[x86] CMOV optimisation overhaul
Summary
This merge request overhauls the CMOV optimisations present in OptPass2Jcc
. Initially just a refactor so it can be more easily ported to AArch64 in the future, some unexpected improvements appeared.
-
CMOV
block creation "outlined" into a separate object that's created only when needed. - Duplicated or similar code greatly reduced to improve maintenance and reduce bugs.
- New combinations 'discovered' and categorised:
- (EXISTS) "Simple":
Jcc .Lbl1 / MOV / .Lbl1:
. Optimised toCMOV(~c)
. - (EXISTS) "Detour":
Jcc .Lbl1 / MOV / JMP .Lbl1
. Optimised toCMOV(~c) / JMP .Lbl1
. - (EXISTS) "Double":
Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1 / MOV / .Lbl2
. Optimised toCMOV(~c) / CMOVcc
, with some optimisations if two CMOVs make a complementary pair (opposite conditions but writing to the same register. - "Branching":
Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1:
. If it's not the classic "Double", this can be converted toCMOV(~c) / J(~c) .Lbl2
. - "Double Branching (Same)":
Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1: / MOV / JMP .Lbl2
. In this case,.Lbl2
appears elsewhere, but there's an extra jump at the end of the 2nd block that goes to the same location. The end result isCMOV(~c) / CMOVcc / JMP .Lbl2
, with some optimisations if two CMOVs make a complementary pair. - "Double Branching (Different)":
Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1: / MOV / JMP .Lbl3
. Like the above, but the two jumps go to different places. This is mostly treated the same as the single "Branching" type, although complementary CMOV optimisations take place as well, otherwise what follows.Lbl1
is largely left untouched. - "Double Second Branching":
Jcc .Lbl1 / MOV / JMP .Lbl2 / .Lbl1: / MOV / JMP .Lbl3 / .Lbl2:
. Like "Double Branching (Different)", only.Lbl2
appears after the second block. Most of the same optimisations as "Double" and "Double Branching (Same)" are performed, only the final jump is made conditional. The end result isCMOV(~c) / CMOVcc / Jcc .Lbl3 / .Lbl2:
.
- (EXISTS) "Simple":
-
OptPass2CMP
andOptPass2TEST
, and a secondaryTryCmpCMovOpts
routine, have been introduced to optimise complementary CMOV pairs and reposition MOV instructions that are otherwise missed in a few circumstances, usually as a result of a "Branching" block being created and the leftoverJcc
making a new CMOV block that for some reason can't be easily merged to make one of the double block types. - Because the
FLAGS
register has to be allocated over much of the CMOV blocks, somemov $0,%reg
instructions don't get optimised toxor %reg,%reg
even though it's actually safe to do. As such,PostPeepholeOptMov
has been adapted so if theFLAGS
are in use, it searches ahead to see if the flags get overwritten (e.g. by aCMP
instruction) and so can be safely scrambled.
For all the double types, .Lbl1
can only have a single reference.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
CMOV block optimisations are now more prevalent and the compiler source refactored to be easier to maintain this particular optimisation.
Development Notes
- During initial testing, there was a mysterious failure on i386-linux, namely the test
tintfcdecl2.pp
, which was odd because the peephole optimizer doesn't run for that test, and turned out to be a code generation error. The bug that caused this, an error in the newTryCmpCMovOpts
routine, was fixed as a result. It goes to show though that it is very hard to track all the effects of new compiler features since they can combine with existing optimisations in unexpected ways or otherwise be disguised until a very particular code sequence unveils them.
Relevant logs and/or screenshots
Well, where to begin. Virtually every source file sees some kind of improvement, both in speed and size, the latter mostly from the stripping of unnecessary alignment fields. For a simple introduction, in the ServiceManager
unit under x86_64-win64, -O4, before:
.section .text.n_servicemanager$_$tservicemanager_$_configservice$qword$tservicedescriptor_$$_stopchar$hiub462steen,"ax"
.balign 16,0x90
.globl SERVICEMANAGER$_$TSERVICEMANAGER_$_CONFIGSERVICE$QWORD$TSERVICEDESCRIPTOR_$$_STOPCHAR$hiuB462sTeEN
SERVICEMANAGER$_$TSERVICEMANAGER_$_CONFIGSERVICE$QWORD$TSERVICEDESCRIPTOR_$$_STOPCHAR$hiuB462sTeEN:
.Lc322:
cmpq $0,(%rdx)
jne .Lj442
xorl %eax,%eax
ret
.p2align 4,,10
.p2align 3
.Lj442:
movq (%rdx),%rdx
testq %rdx,%rdx
jne .Lj444
leaq FPC_EMPTYCHAR(%rip),%rdx
.Lj444:
movq %rdx,%rax
.Lc323:
ret
After:
.section .text.n_servicemanager$_$tservicemanager_$_configservice$qword$tservicedescriptor_$$_stopchar$hiub462steen,"ax"
.balign 16,0x90
.globl SERVICEMANAGER$_$TSERVICEMANAGER_$_CONFIGSERVICE$QWORD$TSERVICEDESCRIPTOR_$$_STOPCHAR$hiuB462sTeEN
SERVICEMANAGER$_$TSERVICEMANAGER_$_CONFIGSERVICE$QWORD$TSERVICEDESCRIPTOR_$$_STOPCHAR$hiuB462sTeEN:
.Lc322:
xorl %ecx,%ecx
cmpq $0,(%rdx)
cmoveq %rcx,%rax
je .Lj443
movq (%rdx),%rdx
testq %rdx,%rdx
jne .Lj444
leaq FPC_EMPTYCHAR(%rip),%rdx
.Lj444:
movq %rdx,%rax
.Lj443:
.Lc323:
ret
This could actually be improved further with the right checks, since jne .Lj444
will always branch (The value at (%rdx)
cannot be zero at that point, since je .Lj443
did not branch). Following the flow of logic, The theoretical best assembly language for this would be:
xorl %eax,%eax
cmpq $0,(%rdx)
cmovneq (%rdx),%rax
ret
Though it would get optimised out in this instance, there is potential to support LEA
instructions in a limited capacity in CMOV
blocks in a similar way to how constants are handled.
In the weditor
unit, showcasing a "double branching (different)" style block - before:
call *232(%rax)
cmpb $0,32(%rsp)
jne .Lj504
xorl %ecx,%ecx
jmp .Lj505
.p2align 4,,10
.p2align 3
.Lj504:
xorl %ecx,%ecx
xorl %r8d,%r8d
jmp .Lj507
After:
call *232(%rax)
xorl %ecx,%ecx
cmpb $0,32(%rsp)
je .Lj505
xorl %ecx,%ecx
xorl %r8d,%r8d
jmp .Lj507
There is an inefficiency in that xorl %ecx,%ecx
gets called twice, but this is not due to this optimisation - indeed, in the original Pascal code:
...
if S='' then
CP:=0
else
begin
CP:=0; RX:=0;
...
CP corresponds to %ecx. Nevertheless, new peephole optimisations or even node-level optimisations should be able to pick this up.
In base64
- before:
...
movq %rcx,%rbx
cmpq $-1,32(%rcx)
je .Lj37
movq 32(%rcx),%r12
jmp .Lj34
.p2align 4,,10
.p2align 3
.Lj37:
...
After:
...
movq %rcx,%rbx
cmpq $-1,32(%rcx)
cmovneq 32(%rcx),%r12
jne .Lj34
...
In the system
unit - before:
...
.Lj309:
movzbl (%rcx),%r9d
movzbl (%rdx),%r11d
subq %r11,%r9
movq %r9,%r10
testq %r9,%r9
jnl .Lj313
movq $-1,%rax
ret
.p2align 4,,10
.p2align 3
.Lj313:
testq %r9,%r9
jng .Lj316
movl $1,%eax
ret
.p2align 4,,10
.p2align 3
.Lj316:
...
.Lc211:
ret
After:
...
.Lj309:
movzbl (%rcx),%r9d
movzbl (%rdx),%r11d
subq %r11,%r9
movq %r9,%r10
movq $-1,%r11
testq %r9,%r9
cmovlq %r11,%rax
jl .Lj307
movl $1,%r11d
testq %r9,%r9
cmovgq %r11,%rax
jg .Lj307
...
.Lj307:
.Lc211:
ret
(the second test
instruction could also be removed - something for later!)
Also in the system unit - before:
.seh_proc SYSTEM$_$STR_REAL$SMALLINT$SMALLINT$DOUBLE$TREAL_TYPE$OPENSTRING_$$_GEN_DIGITS_64$hUXTohxUvDOC
...
cmpq $1000000000,%r9
jnb .Lj1771
xorl %r9d,%r9d
xorl %r12d,%r12d
movl %ecx,%edi
jmp .Lj1772
.p2align 4,,10
.p2align 3
.Lj1771:
...
After:
.seh_proc SYSTEM$_$STR_REAL$SMALLINT$SMALLINT$DOUBLE$TREAL_TYPE$OPENSTRING_$$_GEN_DIGITS_64$hUXTohxUvDOC
...
xorl %eax,%eax
cmpq $1000000000,%r9
cmovbl %eax,%r9d
cmovbl %eax,%r12d
cmovbl %ecx,%edi
jb .Lj1772
...
In aasmcpu
, before:
...
.Lj385:
cmpw $118,92(%rbx)
jne .Lj391
movw $116,%si
jmp .Lj389
.p2align 4,,10
.p2align 3
.Lj391:
cmpw $116,92(%rbx)
jne .Lj394
movw $118,%si
jmp .Lj389
.p2align 4,,10
.p2align 3
.Lj394:
cmpw $59,92(%rbx)
jne .Lj397
movw $57,%si
jmp .Lj389
.p2align 4,,10
.p2align 3
.Lj397:
cmpw $57,92(%rbx)
jne .Lj400
movw $59,%si
jmp .Lj389
.p2align 4,,10
.p2align 3
.Lj400:
cmpw $119,92(%rbx)
jne .Lj403
movw $117,%si
jmp .Lj389
.p2align 4,,10
.p2align 3
.Lj403:
cmpw $117,92(%rbx)
jne .Lj406
movw $119,%si
jmp .Lj389
.p2align 4,,10
.p2align 3
.Lj406:
cmpw $60,92(%rbx)
jne .Lj409
movw $58,%si
jmp .Lj389
.p2align 4,,10
.p2align 3
.Lj409:
...
After:
...
.Lj385:
movw $116,%ax
cmpw $118,92(%rbx)
cmovew %ax,%si
je .Lj389
movw $118,%ax
cmpw $116,92(%rbx)
cmovew %ax,%si
je .Lj389
movw $57,%ax
cmpw $59,92(%rbx)
cmovew %ax,%si
je .Lj389
movw $59,%ax
cmpw $57,92(%rbx)
cmovew %ax,%si
je .Lj389
movw $117,%ax
cmpw $119,92(%rbx)
cmovew %ax,%si
je .Lj389
movw $119,%ax
cmpw $117,92(%rbx)
cmovew %ax,%si
je .Lj389
movw $58,%ax
cmpw $60,92(%rbx)
cmovew %ax,%si
je .Lj389
...