[x86 / Bug Fix] Fixed "Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" optimisations being applied incorrectly if another jump follows (Fixes #40643) (!591) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:i40643 into main Feb 17, 2024

Summary

This merge request addresses #40643 (closed) where an optimisation in OptPass1CMP was applied incorrectly in some circumstances, but which manifested most frequently under -O1.

The reasons for this were two-fold:

The "Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" optimisations did not verify if the flags were used after the conditional jump (e.g. because there's another conditional jump that follows it).
The "CMP/Jcc/CMP; removed superfluous CMP" optimisation did not allocate the flags between the jump and the removed comparison, thus causing the addition of flag-checking in the "Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" optimisations to fail.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

In the example supplied in #40643 (closed), the compiler would generate bad code under -O1 and cause a condition to be evaluated incorrectly due to misapplication of the "Cmp1Jl2Cmp0Jle" optimisation.

What is the behavior after applying this patch?

"Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" should now always be applied correctly.

Additional Notes

Supplied as a separate commit, some minor optimisation boosts were given to the rest of OptPass1CMP that minimises unnecessary re-execution of the subroutine (using Include(OptsToCheck, aoc_ForceNewIteration) rather than Result := True if p itself wasn't changed), along with Prefetch calls in while-loops.

Relevant logs and/or screenshots

In the new webtbs/tw40643 test under x86_64-win64 with -O1 -OoNOPEEPHOLE (thus this code is good):

# [7] if (a < 1) or ((a = 1) and (a = b)) then Exit;
	cmpw	$1,TC_$P$TW40643_$$_A(%rip)
	jl	.Lj3
	jmp	.Lj4
.Lj4:
	cmpw	$1,TC_$P$TW40643_$$_A(%rip)
	je	.Lj6
	jmp	.Lj7
.Lj6:
	movw	TC_$P$TW40643_$$_A(%rip),%ax
	cmpw	TC_$P$TW40643_$$_B(%rip),%ax
	je	.Lj8
	jmp	.Lj7
.Lj8:
	jmp	.Lj3
.Lj7:
	jmp	.Lj5
.Lj3:
	jmp	.Lj1
	.p2align 4,,10
	.p2align 3
.Lj5:
# [9] WriteLn('Fail');

Running the compiler under -O1 but with OptPass1CMP disabled so the jumps are cleaned up and optimised (code still good):

# [7] if (a < 1) or ((a = 1) and (a = b)) then Exit;
	cmpl	$1,TC_$P$TW40643_$$_A(%rip)
	jl	.Lj1
	cmpl	$1,TC_$P$TW40643_$$_A(%rip)
	jne	.Lj5
	movl	TC_$P$TW40643_$$_A(%rip),%eax
	cmpl	U_$P$TW40643_$$_B(%rip),%eax
	je	.Lj1
	.p2align 4,,10
	.p2align 3
.Lj5:
# [9] WriteLn('Fail');

With OptPass1CMP enabled (code is now bad, thus the "Fail" branch is reached unless a <= 0):

# Peephole Optimization: Cmp1Jl2Cmp0Jle
# [7] if (a < 1) or ((a = 1) and (a = b)) then Exit;
	cmpl	$0,TC_$P$TW40643_$$_A(%rip)
	jle	.Lj1
# Peephole Optimization: CMP/Jcc/CMP; removed superfluous CMP
# [9] WriteLn('Fail');

To explain why the block of code got completely removed, after "Cmp1Jl2Cmp0Jle" was applied, the compiler determined that jne .Lj5 was deterministic and would always branch (because if a = 0 and hence jne .Lj5 would not branch, it would have branched at jle .Lj1 instead), and thus turned it into jmp .Lj5 - in turn, this meant that everything between what is now an unconditional jump and the next label is unreachable code and so is stripped out. The next label happens to be .Lj5; because this is a zero-distance jump, jmp .Lj5 gets removed, and then .Lj5 itself because there's no longer any references to it.

With the fixed optimisations (code good again - "Cmp1Jl2Cmp0Jle" no longer applied... compare to the version where OptPass1CMP is disabled... the only difference now is the removal of the 2nd CMP instruction):

# [7] if (a < 1) or ((a = 1) and (a = b)) then Exit;
	cmpl	$1,TC_$P$TW40643_$$_A(%rip)
	jl	.Lj1
# Peephole Optimization: CMP/Jcc/CMP; removed superfluous CMP
	jne	.Lj5
	movl	TC_$P$TW40643_$$_A(%rip),%eax
	cmpl	U_$P$TW40643_$$_B(%rip),%eax
	je	.Lj1
	.p2align 4,,10
	.p2align 3
.Lj5:
# [9] WriteLn('Fail');

Along with the test and the example in the bug report, there was one example of bad code under -O4 in the packages. In the gtk unit (x86_64-win64) - before:

	.file "gtk.pp"
# Begin asmlist al_procedures

.section .text.n_gtk_$$_gtk_check_version$longint$longint$longint$$boolean,"ax"
	.balign 16,0x90
.globl	GTK_$$_GTK_CHECK_VERSION$LONGINT$LONGINT$LONGINT$$BOOLEAN
GTK_$$_GTK_CHECK_VERSION$LONGINT$LONGINT$LONGINT$$BOOLEAN:
.Lc2:
	testl	%ecx,%ecx
	jle	.Lj5
	seteb	%r9b
	...

After:

	.file "gtk.pp"
# Begin asmlist al_procedures

.section .text.n_gtk_$$_gtk_check_version$longint$longint$longint$$boolean,"ax"
	.balign 16,0x90
.globl	GTK_$$_GTK_CHECK_VERSION$LONGINT$LONGINT$LONGINT$$BOOLEAN
GTK_$$_GTK_CHECK_VERSION$LONGINT$LONGINT$LONGINT$$BOOLEAN:
.Lc2:
	cmpl	$1,%ecx
	jl	.Lj5
	seteb	%r9b
	...

After the fix, the "Cmp1Jl2Cmp0Jle" optimisation is not applied, which caused %r9b to be set when %ecx is 0 instead of 1, thus causing unexpected behaviour in the unit.

Edited Feb 17, 2024 by J. Gareth "Kit" Moreton

[x86 / Bug Fix] Fixed "Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" optimisations being applied incorrectly if another jump follows (Fixes #40643)