[x86] Inefficient CMOVcc/Jcc pairs
Summary
Special thanks to @runewalsh over at !610 (merged) for this one!
This merge request aims to correct some overzealous CMOV optimisations that may not be obvious until after the fact. If a "branching" block is made where CMOVcc
is followed by Jcc
with the same condition, and the register set by CMOVcc
is deallocated after the jump (if it doesn't branch), then the CMOVcc
instruction can be changed into a regular MOV
instruction and potentially moved to before the comparison instruction for faster execution speed and smaller code size.
There is also a cleanup of OptPass2MOV
and an additional MOV/CMP/MOV
optimisation that attempts to rearrange registers so more of the MOVs
can be placed before the comparison, thus increasing the chance of macro-fusing CMP
with a subsequent Jcc
. The conditions for this optimisation are quite strict (they require that another MOV
or Jcc
instruction follow) in order to best offset the performance loss from the potential pipeline stall introduced between the first MOV
and CMP
by increasing the chances of macro-fusion between CMP
and Jcc
.
Additional Notes
Unfortunately the most common case of this - setting the function result and exiting (e.g. via Exit(0);
) - tends not to be optimised because the result register is frequently allocated across the entire procedure and so it is seen to still be in use after the jump even though it may get overwritten later.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Some CMOVcc
instructions are changed back into MOV
instructions to increase performance.
Relevant logs and/or screenshots
In aasmcpu
(x86_64-win64, -O4), the most common example with a constant is shown - before:
...
movl $2,%eax
testb $2,32(%rsp)
cmovnel %eax,%edx
jne .Lj572
movzbl 2(%rsi),%edx
.Lj572:
...
After:
...
movl $2,%edx
testb $2,32(%rsp)
jne .Lj572
movzbl 2(%rsi),%edx
.Lj572:
...
The system
unit shows a register-only one - before:
.Lj1527:
...
cmpw %ax,%si
cmovgw %ax,%si
testw %r15w,%r15w
cmovlw %ax,%r13w
jl .Lj1537
...
After:
.Lj1527:
...
cmpw %ax,%si
cmovgw %ax,%si
movw %ax,%r13w
testw %r15w,%r15w
jl .Lj1537
...
Also in the system
unit, the MOV/CMP/MOV
optimisation in OptPass2MOV
comes into play - before:
.section .text.n_system$_$str_real$smallint$smallint$double$treal_type$openstring_$$_gen_digits_64$huxtohxuvdoc,"ax"
...
movq %r9,%rcx
xorl %eax,%eax
cmpq $1000000000,%r9
cmovbl %eax,%r9d
cmovbl %eax,%r12d
cmovbl %ecx,%edi
jb .Lj1743
...
After:
.section .text.n_system$_$str_real$smallint$smallint$double$treal_type$openstring_$$_gen_digits_64$huxtohxuvdoc,"ax"
...
movq %r9,%rcx
xorl %r9d,%r9d
xorl %r12d,%r12d
movl %ecx,%edi
cmpq $1000000000,%rcx
jb .Lj1743
...
In the same procedure in the system
unit, a register and a constant CMOV
get optimised together - before:
.section .text.n_system$_$str_real$smallint$smallint$double$treal_type$openstring_$$_gen_digits_64$huxtohxuvdoc,"ax"
...
movl %ecx,%edi
xorl %ecx,%ecx
cmpq $1000000000,%rdx
cmovbl %ecx,%r9d
cmovbl %edx,%r12d
jb .Lj1743
...
After:
.section .text.n_system$_$str_real$smallint$smallint$double$treal_type$openstring_$$_gen_digits_64$huxtohxuvdoc,"ax"
...
movl %ecx,%edi
xorl %r9d,%r9d
movl %edx,%r12d
cmpq $1000000000,%rdx
jb .Lj1743
...
The refactor of OptPass2MOV
also fixed a case of inefficient optimisation in rgobj
- before:
...
.Lj479:
movl %esi,%eax
movl %ebx,%edx
addq %rdx,%rax
shrq $1,%rax
movl %eax,%ebp
...
After:
...
.Lj479:
movl %esi,%eax
addl %ebx,%eax
rcrl $1,%eax
movl %eax,%ebp
...
Future Work
There was a noted example where performance is lost slightly in SysUtils
- before:
.section .text.n_sysutils_$$_datetimetofiledate$tdatetime$$int64,"ax"
...
setab %dl
xorl %ecx,%ecx
orb %dl,%al
cmovneq %rcx,%rax
jne .Lj11254
After:
.section .text.n_sysutils_$$_datetimetofiledate$tdatetime$$int64,"ax"
...
setab %dl
orb %dl,%al
testb %al,%al
movl $0,%eax
jne .Lj11254
...
Here, %dl
is not used after the OR
instruction (%eax
is the function result, and the jump goes to the function epilogue), and one potential optimisation, once the correct conditions are properly worked out, is to reverse the operands on the OR
instruction and change the TEST
instruction to use %dl
instead, thus becoming:
.section .text.n_sysutils_$$_datetimetofiledate$tdatetime$$int64,"ax"
...
setab %dl
orb %al,%dl
testb %dl,%dl
movl $0,%eax
jne .Lj11254
...
Which can then the further optimised to:
.section .text.n_sysutils_$$_datetimetofiledate$tdatetime$$int64,"ax"
...
setab %dl
orb %al,%dl
xorl %eax,%eax
testb %dl,%dl
jne .Lj11254
...
Then the performance is now truly faster since MOV 0
is positioned before TEST
and becomes XOR
again, can execute in parallel with the OR
instruction due to register renaming, and the TEST
and JNE
instruction can undergo macro-fusion (still not perfect as the TEST
instruction doesn't get removed by the post-peephole stage, but the cycle count is 3 (compared to about 4 in trunk and this patch) and the actual machine code is smaller).