[x86] OptPass1LEA improvements (!324) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:double-lea into main Nov 09, 2022

Summary

This merge request streamlines and improves the LEA/LEA optimisations under OptPass1LEA to better reduce instruction counts and break dependency chains.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Optimisations involving lea instructions are slightly improved.

Relevant logs and/or screenshots

(All comparisons done under -O4, x86_64-win64)

A large number of units receive improvements.

A couple of simple ones in the graph unit (the shll and addl instructions are converted from lea instructions in Pass 2):

.section .text.n_graph_$$_floodfill$smallint$smallint$word,"ax"
	...
	leaq	1(%rdx),%rax
	leaq	(,%rax,8),%rsi
	...
.section .text.n_graph_$$_wincreate$$qword,"ax"
	...
	shll	$1,%eax
	addl	%esi,%eax
	...

After:

.section .text.n_graph_$$_floodfill$smallint$smallint$word,"ax"
	...
	leaq	8(,%rdx,8),%rsi
	...
.section .text.n_graph_$$_wincreate$$qword,"ax"
	...
	leal	(%esi,%eax,2),%eax
	...

in the blowfish unit, an optimisation is performed where the instructions aren't adjacent - before:

.section .text.n_blowfish$_$tblowfishencryptstream_$__$$_flush,"ax"
	...
	movq	%rcx,%rbx
	cmpb	$0,48(%rcx)
	jna	.Lj117
	leaq	40(%rcx),%rdx
	movzbl	48(%rcx),%eax
	leaq	(%rdx,%rax,1),%rcx
	movzbl	48(%rbx),%eax
	movl	$8,%edx
	...

After:

.section .text.n_blowfish$_$tblowfishencryptstream_$__$$_flush,"ax"
	...
	movq	%rcx,%rbx
	cmpb	$0,48(%rcx)
	jna	.Lj117
	movzbl	48(%rcx),%eax
	leaq	40(%rcx,%rax,1),%rcx
	movzbl	48(%rbx),%eax
	movl	$8,%edx
	...

This is an interesting block because movzbl 48(%rbx),%eax is logically redundant, since %eax contains the value of movzbl 48(%rcx),%eax at that point, where %rbx = %rcx. The feature in !191 can possibly the redundant instruction with some tuning.

In the chmreader unit, the improved optimisations causes a better cascade that reduces the instructions in a block from 4 to 2 - before:

.section .text.n_chmreader$_$tarrayhelper$1$crc8d68c026_crc1ceae787_$__$$_median$hhkolmv67x4h,"ax"
	...
	movq	%r8,%rsi
	movq	%rcx,%rbx
	movq	%rdx,%rax
	shrq	$1,%rax
	shlq	$3,%rax
	leaq	(%rax,%rcx),%rdi
	leaq	-8(,%rdx,8),%rax
	leaq	(%rax,%rcx),%r12	...

After:

.section .text.n_chmreader$_$tarrayhelper$1$crc8d68c026_crc1ceae787_$__$$_median$hhkolmv67x4h,"ax"
	...
	movq	%r8,%rsi
	movq	%rcx,%rbx
	movq	%rdx,%rax
	shrq	$1,%rax
	leaq	(%rcx,%rax,8),%rdi
	leaq	-8(%rcx,%rdx,8),%r12
	...

In the system unit, the optimisation is performed even though the intermediate register is still in use afterwards, meaning the original instruction isn't removed (this optimisation isn't performed under -Os), however it breaks the dependency chain - before:

.section .text.n_system_$$_align$qword$qword$$qword,"ax"
	...
	movq	%rdx,%r8
	leaq	-1(%rdx),%rax
	leaq	(%rcx,%rax),%r9
	andq	%rax,%rdx

After:

.section .text.n_system_$$_align$qword$qword$$qword,"ax"
	...
	movq	%rdx,%r8
	leaq	-1(%rdx),%rax
	leaq	-1(%rcx,%rdx),%r9
	andq	%rax,%rdx

This one is also interesting from an optimisation standpoint because %r8 doesn't actually get used in the procedure and so movq %rdx,%r8 can be removed.

[x86] OptPass1LEA improvements

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Merge request reports