[x86 / Bug Fix] Fixed "Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" optimisations being applied incorrectly if another jump follows (Fixes #40643)
Summary
This merge request addresses #40643 (closed) where an optimisation in OptPass1CMP
was applied incorrectly in some circumstances, but which manifested most frequently under -O1.
The reasons for this were two-fold:
- The "Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" optimisations did not verify if the flags were used after the conditional jump (e.g. because there's another conditional jump that follows it).
- The "CMP/Jcc/CMP; removed superfluous CMP" optimisation did not allocate the flags between the jump and the removed comparison, thus causing the addition of flag-checking in the "Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" optimisations to fail.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
In the example supplied in #40643 (closed), the compiler would generate bad code under -O1 and cause a condition to be evaluated incorrectly due to misapplication of the "Cmp1Jl2Cmp0Jle" optimisation.
What is the behavior after applying this patch?
"Cmp1Jl2Cmp0Jle" and "CmpJe2NegJo" should now always be applied correctly.
Additional Notes
Supplied as a separate commit, some minor optimisation boosts were given to the rest of OptPass1CMP
that minimises unnecessary re-execution of the subroutine (using Include(OptsToCheck, aoc_ForceNewIteration)
rather than Result := True
if p
itself wasn't changed), along with Prefetch
calls in while-loops.
Relevant logs and/or screenshots
In the new webtbs/tw40643
test under x86_64-win64 with -O1 -OoNOPEEPHOLE
(thus this code is good):
# [7] if (a < 1) or ((a = 1) and (a = b)) then Exit;
cmpw $1,TC_$P$TW40643_$$_A(%rip)
jl .Lj3
jmp .Lj4
.Lj4:
cmpw $1,TC_$P$TW40643_$$_A(%rip)
je .Lj6
jmp .Lj7
.Lj6:
movw TC_$P$TW40643_$$_A(%rip),%ax
cmpw TC_$P$TW40643_$$_B(%rip),%ax
je .Lj8
jmp .Lj7
.Lj8:
jmp .Lj3
.Lj7:
jmp .Lj5
.Lj3:
jmp .Lj1
.p2align 4,,10
.p2align 3
.Lj5:
# [9] WriteLn('Fail');
Running the compiler under -O1
but with OptPass1CMP
disabled so the jumps are cleaned up and optimised (code still good):
# [7] if (a < 1) or ((a = 1) and (a = b)) then Exit;
cmpl $1,TC_$P$TW40643_$$_A(%rip)
jl .Lj1
cmpl $1,TC_$P$TW40643_$$_A(%rip)
jne .Lj5
movl TC_$P$TW40643_$$_A(%rip),%eax
cmpl U_$P$TW40643_$$_B(%rip),%eax
je .Lj1
.p2align 4,,10
.p2align 3
.Lj5:
# [9] WriteLn('Fail');
With OptPass1CMP
enabled (code is now bad, thus the "Fail" branch is reached unless a <= 0):
# Peephole Optimization: Cmp1Jl2Cmp0Jle
# [7] if (a < 1) or ((a = 1) and (a = b)) then Exit;
cmpl $0,TC_$P$TW40643_$$_A(%rip)
jle .Lj1
# Peephole Optimization: CMP/Jcc/CMP; removed superfluous CMP
# [9] WriteLn('Fail');
To explain why the block of code got completely removed, after "Cmp1Jl2Cmp0Jle" was applied, the compiler determined that jne .Lj5
was deterministic and would always branch (because if a = 0 and hence jne .Lj5
would not branch, it would have branched at jle .Lj1
instead), and thus turned it into jmp .Lj5
- in turn, this meant that everything between what is now an unconditional jump and the next label is unreachable code and so is stripped out. The next label happens to be .Lj5
; because this is a zero-distance jump, jmp .Lj5
gets removed, and then .Lj5
itself because there's no longer any references to it.
With the fixed optimisations (code good again - "Cmp1Jl2Cmp0Jle" no longer applied... compare to the version where OptPass1CMP
is disabled... the only difference now is the removal of the 2nd CMP instruction):
# [7] if (a < 1) or ((a = 1) and (a = b)) then Exit;
cmpl $1,TC_$P$TW40643_$$_A(%rip)
jl .Lj1
# Peephole Optimization: CMP/Jcc/CMP; removed superfluous CMP
jne .Lj5
movl TC_$P$TW40643_$$_A(%rip),%eax
cmpl U_$P$TW40643_$$_B(%rip),%eax
je .Lj1
.p2align 4,,10
.p2align 3
.Lj5:
# [9] WriteLn('Fail');
Along with the test and the example in the bug report, there was one example of bad code under -O4 in the packages. In the gtk
unit (x86_64-win64) - before:
.file "gtk.pp"
# Begin asmlist al_procedures
.section .text.n_gtk_$$_gtk_check_version$longint$longint$longint$$boolean,"ax"
.balign 16,0x90
.globl GTK_$$_GTK_CHECK_VERSION$LONGINT$LONGINT$LONGINT$$BOOLEAN
GTK_$$_GTK_CHECK_VERSION$LONGINT$LONGINT$LONGINT$$BOOLEAN:
.Lc2:
testl %ecx,%ecx
jle .Lj5
seteb %r9b
...
After:
.file "gtk.pp"
# Begin asmlist al_procedures
.section .text.n_gtk_$$_gtk_check_version$longint$longint$longint$$boolean,"ax"
.balign 16,0x90
.globl GTK_$$_GTK_CHECK_VERSION$LONGINT$LONGINT$LONGINT$$BOOLEAN
GTK_$$_GTK_CHECK_VERSION$LONGINT$LONGINT$LONGINT$$BOOLEAN:
.Lc2:
cmpl $1,%ecx
jl .Lj5
seteb %r9b
...
After the fix, the "Cmp1Jl2Cmp0Jle" optimisation is not applied, which caused %r9b
to be set when %ecx
is 0 instead of 1, thus causing unexpected behaviour in the unit.