[x86] Refactoring and extensions of CMOV optimisations
Summary
This merge request (which has been on and off for a few months now, hence the internal error numbers) rewrites the CMOV optimisations in OptPass2Jcc
for x86 platforms and adds some new features, including the ability to optimise MOVs that write immediates to registers in places.
More specifically:
- Existing CMOV optimisations in
OptPass2Jcc
have been refactored, mainly because it was proving difficult to remember what each local variable was set to (they now have more meaningful names, like hp_jump rather than hp4). - As a side-effect of the refactor, alignment hints are now properly removed with the labels during the CMOV optimisations.
-
OptPass2Jcc
can now deal with Jcc/MOV/.lbl combinations where the MOV writes an immediate (constant) to a register by making use of the fairly newGetIntRegisterBetween
instruction. This optimisation is not performed under -Os due to it creating additional instructions and hence increasing code size. - As part of the CMOV optimisations, the compiler will now pick up when two opposing CMOVs (have opposite conditions) write to the same register, and will try to ensure the first one is a MOV and is positioned as early as possible in order to improve performance.
- As part of the above two points, the compiler will attempt to move the
MOV
instructions to before theCMP
orTEST
instruction (if they exist) that appears prior to theJcc
instruction. It will also try to go further if something likeSUB
orAND
etc. appears before that instruction, thus ensuring optimisations performed later byPostPeepholeOptTestOr
are not lost. - A new
TrySwapMovOp
method has been introduced to facilitate the above, andTrySwapMovCmp
has been rewritten to use this. At the same time, some bug fixes were made whereMOVSS
was checked instead ofMOVSD
when checking to see if it's the 0-operand (i.e.REP MOVSD
) or 2-operand version (the SSE instruction). -
IsRefSafe
now returns true if the reference type is addr_full, although i386 does not currently permit addr_full references in CMOV optimisations because these sometimes get turned into immediates (usually because the reference points to a small global constant) and raise assembler errors when they appear in CMOV instructions. This will be subject to further research. -
OptPass1CMP
now detects CMP/JE/MOV/.lbl and CMP/JNE/MOV/.lbl combinations where the operands of the CMP and MOV instructions are identical (possibly in reverse order). When found, only the MOV is kept when the conditional jump is JE, and everything is removed (unless the label has other references) in the JNE case, because the end result is that the two operands will always be identical afterwards. The case with JE seems to be popular in property setters. - Even if
IsRefSafe
returns False, if a reference in a MOV instruction appears exactly as it does in the most recent comparison (the one that sets the current conditional flags), then CMOV is permitted because if the reference is invalid, a segmentation fault will have already occurred.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
More opportunities for removing conditional branches are found, converting them to CMOV sequences.
Relevant logs and/or screenshots
An example of the CMOV optimisation working successfully with an immediate in aasmcpu.s (under x86_64-win64 -O4) - before:
...
.Lj400:
cmpw $58,92(%rbx)
jne .Lj380
movw $60,%si
.Lj380:
...
After:
...
.Lj400:
movw $60,%ax
cmpw $58,92(%rbx)
cmovew %ax,%si
.Lj380:
...
A nice example in classes.s where two variables are conditionally set to zero, using the same temporary register - before:
...
.Lj346:
...
cmpq %rax,%r9
jne .Lj352
xorl %eax,%eax
xorl %ecx,%ecx
.Lj352:
...
After:
...
.Lj346:
...
xorl %ebx,%ebx
cmpq %rax,%r9
cmoveq %rbx,%rax
cmoveq %rbx,%rcx
.Lj352:
...
An example of a CMOV(c)/CMOV(~c) pair being converted to MOV/CMOV(~c) in avl_tree.s... - before:
...
.Lj682:
cmpl %ebx,%edx
cmovll %ebx,%eax
cmovnll %edx,%eax
...
After:
...
.Lj682:
movl %ebx,%eax
cmpl %ebx,%edx
cmovnll %edx,%eax
...
In the same unit, we get to see the optimisation in OptPass1Cmp in action (intermingled with the CMOV constant optimisation) - before:
.section .text.n_avl_tree$_$tavltreenodememmanager_$__$$_setmaxfreeratio$int64,"ax"
.balign 16,0x90
.globl AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64
AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64:
movq %rcx,%rax
testq %rdx,%rdx
jnl .Lj814
xorl %edx,%edx
.Lj814:
cmpq 40(%rax),%rdx
je .Lj811
movq %rdx,40(%rax)
.Lj811:
ret
After:
.section .text.n_avl_tree$_$tavltreenodememmanager_$__$$_setmaxfreeratio$int64,"ax"
.balign 16,0x90
.globl AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64
AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64:
movq %rcx,%rax
xorl %ecx,%ecx
testq %rdx,%rdx
cmovlq %rcx,%rdx
movq %rdx,40(%rax)
ret
It's not perfect, as a stronger optimisation would be to swap %rax and %rcx (%rax isn't used in the result), although this is rather deep - i.e.:
.section .text.n_avl_tree$_$tavltreenodememmanager_$__$$_setmaxfreeratio$int64,"ax"
.balign 16,0x90
.globl AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64
AVL_TREE$_$TAVLTREENODEMEMMANAGER_$__$$_SETMAXFREERATIO$INT64:
xorl %eax,%eax
testq %rdx,%rdx
cmovlq %rax,%rdx
movq %rdx,40(%rcx)
ret
To emphasise the point about it being popular in property setters, this is the Pascal code for the above procedure:
procedure TAVLTreeNodeMemManager.SetMaxFreeRatio(NewValue: SizeInt);
begin
if NewValue<0 then NewValue:=0;
if NewValue=FMaxFreeRatio then exit;
FMaxFreeRatio:=NewValue;
end;
A slightly less optimal CMOV(c)/CMOV(~c) conversion in cclasses.s - before:
...
.Lj1119:
...
movq %rax,%r13
cmpq %r14,%rax
cmovgeq %r14,%rax
cmovngeq %r13,%rax
...
After:
...
.Lj1119:
...
movq %rax,%r13
cmpq %r14,%rax
movq %r14,%rax ; Note that the MOV instruction is not moved to before the CMP instruction because it would change the comparison result.
cmovngeq %r13,%rax
...
This one might need analysis of the Pascal source for inefficiencies or to look more closely at the peephole optimizer, because the code still isn't optimal, mainly because %r13 = %rax at this point (%r13 isn't used afterwards), and the optimal code would actually be:
...
.Lj1119:
...
cmpq %r14,%rax
cmovgeq %r14,%rax ; (strictly speaking, cmovgq %r14,%rax would be best since this removes the extra latency of writing to %rax when it's already equal to %r14)
...
Additional Notes
Quite a new cases of duplicate code have been noticed where the compiler checks to see if a comparison is testing against zero. Since it is before the post-peephole stage, it searches for cmp 0,%reg
, test %reg,%reg
and test $-1,%reg
. A future refactor will put all these individual checks into a single Boolean function.