[x86] Improved cross-jump optimisation in OptPass1MOV (!201) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:cross-jump-cancel into main Apr 19, 2022

Summary

This merge request significantly improves the cross-jump optimisation run within OptPass1MOV by allowing optimisations across labels in controlled circumstancecs, specifically when the labels' reference counts match the number of jumps to it that have been recently seen - this generally corresponds to if..else blocks, so the optimisation is performed very frequently.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

More potential pipeline stalls and unnecessary MOV instructions are eliminated.

Additional Notes

Internally, the GetNextInstructionUsingRegCond was upgraded to use an internal list to track each conditional jump it passed, and in one special case, an unconditional jump (normally it would break out if it ran into one of these). Each time a jump is found, it makes a note of the label destination and increments a counter for it. If a label is found, its reference count is compared to this counter, and if they match, then the compiler knows that, from its starting point (a MOV instruction for a particular register it is tracking), the program cannot jump to this label from elsewhere and so can assume that the register it is tracking still has the same value. At this point, it removes the label from the tracking list. If the tracking list becomes completely empty as a result, then the compiler knows that all code paths since the starting point have converged and it is no longer a cross-jump situation, and CrossJump can be set to false.

Despite the substantial additions, the compiler on x86_64-win64 only increases by 512 bytes under -O4 thanks to the optimisations doing their work.

The implementation is not as clean as I would have liked, because performing these improved cross-jump optimisations as soon as possible caused a lot of smaller optimisations to be missed out, and moving it to Pass 2 caused more fundamental optimisations to get overlooked, like changing "lea x(%reg),%reg" to "add x,%reg". The solution was not to perform the advanced cross-jump optimisations (the ones involving the internal list) until a second iteration of Pass 1. Due to the set-up, simply setting the Result to True for some individual optimisation routine was not possible, as this could cause an infinite loop if the current instruction wasn't actually modified, and so a new optimisation flag named aoc_ForceNewIteration was included that told the compiler to run Pass 1 again even if no changes were made. At the same time, a NotFirstIteration property was introduced into the TAOptObj class so the cross-jump optimisation knows when it's not the first iteration of Pass 1 (it is False when it's not the first iteration, and True otherwise).

Finally, as a minor aside, the temporary register variables in OptPass1MOV were renamed for reasons of consistency, clarity and maintainability. It was partly necessary for improving the cross-jump optimisations, while still keeping some of the advantages of temporary storage.

Tested and confirmed no regressions on i386-win32, x86_64-win64 and x86_64-linux.

Relevant logs and/or screenshots

Most of the changes tend to be the renaming of registers to prevent pipeline stalls, although they are probably too far away (and on the other side of conditional jumps etc.) to make any difference, although they're harmless enough and may open up single-line optimisations - for example, in the Classes unit (under x86_64-win64) - before:

.globl	CLASSES$_$TBYTESSTREAM_$__$$_REALLOC$INT64$$POINTER
CLASSES$_$TBYTESSTREAM_$__$$_REALLOC$INT64$$POINTER:
.seh_proc CLASSES$_$TBYTESSTREAM_$__$$_REALLOC$INT64$$POINTER
	...
	movq	%rcx,%rbx
	...
	(Lots of instructions, jumps and labels, but whose references are all accounted for)
	...
.Lj991:
	...
.Lj994:
	...
.Lj992:
	...
.Lj997:
	...
	leaq	48(%rbx),%rcx
	...

After:

.globl	CLASSES$_$TBYTESSTREAM_$__$$_REALLOC$INT64$$POINTER
CLASSES$_$TBYTESSTREAM_$__$$_REALLOC$INT64$$POINTER:
.seh_proc CLASSES$_$TBYTESSTREAM_$__$$_REALLOC$INT64$$POINTER
	...
	movq	%rcx,%rbx
	...
	(Lots of instructions, jumps and labels, but whose references are all accounted for)
	...
.Lj991:
	...
.Lj994:
	...
.Lj992:
	...
.Lj997:
	...
	addq	$48,%rcx
	...

Also from Classes - before:

.globl	CLASSES$_$TFPLIST_$__$$_ASSIGN$TFPLIST$TLISTASSIGNOP$TFPLIST
CLASSES$_$TFPLIST_$__$$_ASSIGN$TFPLIST$TLISTASSIGNOP$TFPLIST:
.seh_proc CLASSES$_$TFPLIST_$__$$_ASSIGN$TFPLIST$TLISTASSIGNOP$TFPLIST
	pushq	%rbx
.seh_pushreg %rbx
	pushq	%rdi
.seh_pushreg %rdi
	pushq	%rsi
.seh_pushreg %rsi
	leaq	-32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
	movq	%rcx,%rbx
	movq	%rdx,%rsi
	movq	%r9,%rdi
	testl	%r8d,%r8d
	je	.Lj2475
	subl	$1,%r8d
	je	.Lj2477
	subl	$1,%r8d
	je	.Lj2479
	subl	$1,%r8d
	je	.Lj2480
	subl	$1,%r8d
	je	.Lj2476
	subl	$1,%r8d
	je	.Lj2478
	jmp	.Lj2474
	.balign 16,0x90
.Lj2475:
	movq	%rdi,%r8
	movq	%rsi,%rdx
	movq	%rbx,%rcx
	...

The cross-jump optimisation is able to look beyond "jmp .Lj2474" because the label that follows, .Lj2475, only has a single reference, and the optimizer went past that reference at the "je .Lj2475" instruction, so it continues the search after the unconditional jump / label pair. The CrossJump Boolean variable is still set to True because it doesn't know what's happening with %rbx, %rsi and %rdi at .Lj2474 (where the unconditional jump branches) and on the other sides of all of the other unconditional jumps, so the original assignments can't be removed. Nevertheless, %rdi gets replaced with %r9 and both "movq %rsi,%rdx" and "movq %rbx,%rcx" get removed because the registers are guaranteed to be equal in value at that point. Hence... after:

.globl	CLASSES$_$TFPLIST_$__$$_ASSIGN$TFPLIST$TLISTASSIGNOP$TFPLIST
CLASSES$_$TFPLIST_$__$$_ASSIGN$TFPLIST$TLISTASSIGNOP$TFPLIST:
.seh_proc CLASSES$_$TFPLIST_$__$$_ASSIGN$TFPLIST$TLISTASSIGNOP$TFPLIST
	pushq	%rbx
.seh_pushreg %rbx
	pushq	%rdi
.seh_pushreg %rdi
	pushq	%rsi
.seh_pushreg %rsi
	leaq	-32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
	movq	%rcx,%rbx
	movq	%rdx,%rsi
	movq	%r9,%rdi
	testl	%r8d,%r8d
	je	.Lj2475
	subl	$1,%r8d
	je	.Lj2477
	subl	$1,%r8d
	je	.Lj2479
	subl	$1,%r8d
	je	.Lj2480
	subl	$1,%r8d
	je	.Lj2476
	subl	$1,%r8d
	je	.Lj2478
	jmp	.Lj2474
	.balign 16,0x90
.Lj2475:
	movq	%r9,%r8
	...

In the same unit:

.globl	CLASSES$_$TBASICACTIONLINK_$_EXECUTE$TCOMPONENT$$BOOLEAN_$$_fin$00000833
CLASSES$_$TBASICACTIONLINK_$_EXECUTE$TCOMPONENT$$BOOLEAN_$$_fin$00000833:
.seh_proc CLASSES$_$TBASICACTIONLINK_$_EXECUTE$TCOMPONENT$$BOOLEAN_$$_fin$00000833
	pushq	%rbp
.seh_pushreg %rbp
	movq	%rcx,%rbp ; <-- The improved optimisation removes this line
.seh_endprologue
	movq	-16(%rcx),%rax
	cmpq	$0,24(%rax)
	je	.Lj5832
	movq	-16(%rcx),%rax
	movq	24(%rax),%rax
	movq	$0,96(%rax)
.Lj5832:
	popq	%rbp
	ret
.seh_endproc

In this situation, it's worth noting that %rbp ends up unused and a future optimisation could possibly attempt to remove the pushing and popping of %rbp and the SEH hints.

In the SysUtils unit, a pair of register writes for actual parameters get removed since they already equal the required values - before:

.globl	SYSUTILS_$$_UPPERCASE$ANSISTRING$TLOCALEOPTIONS$$ANSISTRING
SYSUTILS_$$_UPPERCASE$ANSISTRING$TLOCALEOPTIONS$$ANSISTRING:
.seh_proc SYSUTILS_$$_UPPERCASE$ANSISTRING$TLOCALEOPTIONS$$ANSISTRING
	pushq	%rbx
.seh_pushreg %rbx
	pushq	%rsi
.seh_pushreg %rsi
	leaq	-40(%rsp),%rsp
.seh_stackalloc 40
.seh_endprologue
	movq	%rcx,%rbx
	movq	%rdx,%rsi
	testl	%r8d,%r8d
	je	.Lj3642
	subl	$1,%r8d
	je	.Lj3643
	jmp	.Lj3641
	.balign 16,0x90
.Lj3642:
	movq	%rbx,%rcx
	movq	%rsi,%rdx
	call	SYSUTILS_$$_UPPERCASE$ANSISTRING$$ANSISTRING

After:

.globl	SYSUTILS_$$_UPPERCASE$ANSISTRING$TLOCALEOPTIONS$$ANSISTRING
SYSUTILS_$$_UPPERCASE$ANSISTRING$TLOCALEOPTIONS$$ANSISTRING:
.seh_proc SYSUTILS_$$_UPPERCASE$ANSISTRING$TLOCALEOPTIONS$$ANSISTRING
	pushq	%rbx
.seh_pushreg %rbx
	pushq	%rsi
.seh_pushreg %rsi
	leaq	-40(%rsp),%rsp
.seh_stackalloc 40
.seh_endprologue
	movq	%rcx,%rbx
	movq	%rdx,%rsi
	testl	%r8d,%r8d
	je	.Lj3642
	subl	$1,%r8d
	je	.Lj3643
	jmp	.Lj3641
	.balign 16,0x90
.Lj3642:
	call	SYSUTILS_$$_UPPERCASE$ANSISTRING$$ANSISTRING

In the aasmcnst unit of the compiler, a jump chain is reduced in the tai_aggregatetypedconst.addvalue method, as well as the removal of a redundant MOV - before:

.section .text.n_aasmcnst$_$tai_aggregatetypedconst_$__$$_addvalue$tai_abstracttypedconst,"ax"
	.balign 16,0x90
.globl	AASMCNST$_$TAI_AGGREGATETYPEDCONST_$__$$_ADDVALUE$TAI_ABSTRACTTYPEDCONST
AASMCNST$_$TAI_AGGREGATETYPEDCONST_$__$$_ADDVALUE$TAI_ABSTRACTTYPEDCONST:
.seh_proc AASMCNST$_$TAI_AGGREGATETYPEDCONST_$__$$_ADDVALUE$TAI_ABSTRACTTYPEDCONST
	pushq	%rbx
.seh_pushreg %rbx
	pushq	%rsi
.seh_pushreg %rsi
	leaq	-40(%rsp),%rsp
.seh_stackalloc 40
.seh_endprologue
	movq	%rcx,%rbx
	movq	%rdx,%rsi
	cmpb	$0,64(%rcx)
	jne	.Lj139
	cmpb	$0,48(%rdx)
	jne	.Lj141
	movq	56(%rdx),%rax
	cmpb	$4,32(%rax)
	jne	.Lj141
.Lj139:
	cmpb	$0,64(%rbx)
	jne	.Lj146
	movq	56(%rbx),%rax
	movq	8(%rax),%rax
	cmpl	$0,16(%rax)
	jng	.Lj146
	movq	%rbx,%rcx
	call	AASMCNST$_$TAI_AGGREGATETYPEDCONST_$__$$_CONVERT_TO_STRING
.Lj146:

After:

.globl	AASMCNST$_$TAI_AGGREGATETYPEDCONST_$__$$_ADDVALUE$TAI_ABSTRACTTYPEDCONST
AASMCNST$_$TAI_AGGREGATETYPEDCONST_$__$$_ADDVALUE$TAI_ABSTRACTTYPEDCONST:
.seh_proc AASMCNST$_$TAI_AGGREGATETYPEDCONST_$__$$_ADDVALUE$TAI_ABSTRACTTYPEDCONST
	pushq	%rbx
.seh_pushreg %rbx
	pushq	%rsi
.seh_pushreg %rsi
	leaq	-40(%rsp),%rsp
.seh_stackalloc 40
.seh_endprologue
	movq	%rcx,%rbx
	movq	%rdx,%rsi
	cmpb	$0,64(%rcx)
	jne	.Lj146 ; <- After (*) gets changed, the "CMP/Jcc/@Lbl/CMP/Jcc -> CMP/Jcc, redirecting first jump" optimisation changes the destination.
	cmpb	$0,48(%rdx)
	jne	.Lj141
	movq	56(%rdx),%rax
	cmpb	$4,32(%rax)
	jne	.Lj141
	cmpb	$0,64(%rcx) ; (*) (see above)
	jne	.Lj146
	movq	56(%rcx),%rax
	movq	8(%rax),%rax
	cmpl	$0,16(%rax)
	jng	.Lj146
	call	AASMCNST$_$TAI_AGGREGATETYPEDCONST_$__$$_CONVERT_TO_STRING
.Lj146:

Also, for (*), a future optimisation could note that 64(%rcx) is the same reference as before, %rcx has not changed value and there have been no writes to memory, and since it's comparing against zero again, this comparison is redundant because it will produce the same result as the first one and the jump will not be taken (as it wasn't taken the first time).

I guess the compiler will become slightly faster! Before:

.globl	AASMCPU$_$TAICPU_$__$$_RESETPASS2
AASMCPU$_$TAICPU_$__$$_RESETPASS2:
	movq	%rcx,%rax
	movq	104(%rcx),%rdx
	testq	%rdx,%rdx
	je	.Lj642
	testb	$2,53(%rdx)
	je	.Lj642
	movq	$0,104(%rcx)
	movb	$0,122(%rcx)
.Lj642:
	movl	$-1,116(%rax)
	ret

After:

.globl	AASMCPU$_$TAICPU_$__$$_RESETPASS2
AASMCPU$_$TAICPU_$__$$_RESETPASS2:
	movq	104(%rcx),%rdx
	testq	%rdx,%rdx
	je	.Lj642
	testb	$2,53(%rdx)
	je	.Lj642
	movq	$0,104(%rcx)
	movb	$0,122(%rcx)
.Lj642:
	movl	$-1,116(%rcx)
	ret

A fun little one in SysUtils - before:

.Lj383:
	...
	movl	%eax,%r12d
	testl	%eax,%eax
	je	.Lj392
	...
	je	.Lj394
	...
.Lj394:
	...
	movslq	%r12d,%rax
	...

After:

.Lj383:
	...
	movl	%eax,%r12d
	testl	%eax,%eax
	je	.Lj392
	...
	je	.Lj394
	...
.Lj394:
	...
	cltq
	...

[x86] Improved cross-jump optimisation in OptPass1MOV

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Additional Notes

Relevant logs and/or screenshots

Merge request reports