[x86] Deterministic MOV optimisations
Summary
This merge request contains a number of new peephole optimisations where it can identify that a particular register value is a specific determined value and replace MOV %reg1,%reg
instructions with MOV x,%reg2
, and further do the same with arithmetic and logical instructions.
The first commit is the original CMP/JNE/MOV
optimisation that unlocked a heap of deterministic potential, and the subsequent commits capitalise on this. Further improvements can be made, but this patch has gotten bloated enough as it is.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Speed gains are achieved by identifying registers with deterministic values across certain comparisons.
Additional notes
These optimisations tend to increase code size. Adjustments for -Os
will be added at a later date.
Relevant logs and/or screenshots
A huge number of units gain an improvement. All log examples are under x86_64-win64, -O4.
In the classes
unit, a reference rather than a register is deterministic - before:
.section .text.n_classes$_$twriter_$__$$_writeproperty$tpersistent$pointer,"ax"
...
.Lj10217:
cmpq $0,-192(%rbp)
jne .Lj10219
movq -192(%rbp),%rax
...
After (a second example appears later in the code, differing only by the offset):
.section .text.n_classes$_$twriter_$__$$_writeproperty$tpersistent$pointer,"ax"
...
.Lj10217:
cmpq $0,-192(%rbp)
jne .Lj10219
xorl %eax,%eax
...
In the cldrhelper
unit, a reference once again is deterministic, only this time it permits the removal of a subsequent LEA
instruction - before:
.section .text.n_cldrhelper$_$torderedcharacters_$__$$_insert$treorderunit$longint$$longint,"ax"
...
cmpl $0,(%rcx)
jne .Lj1285
movl (%rcx),%eax
leal 1(%eax),%edx
call CLDRHELPER$_$TORDEREDCHARACTERS_$__$$_ENSURESIZE$LONGINT
...
After:
.section .text.n_cldrhelper$_$torderedcharacters_$__$$_insert$treorderunit$longint$$longint,"ax"
...
cmpl $0,(%rcx)
jne .Lj1285
movl $1,%edx
call CLDRHELPER$_$TORDEREDCHARACTERS_$__$$_ENSURESIZE$LONGINT
...
To showcase a simple case of a logical instruction being converted into a MOV
, in the aasmcnst
unit - before:
...
.Lj303:
movb $1,%bl
testw $256,72(%rbp)
je .Lj305
orb $2,%bl
.Lj305:
...
After - %bl
is guaranteed to be equal to 1 when the OR
instruction is reached, so its operand is merged with %bl
's current value and converted to a MOV
. Changing OR
to MOV
removes the dependency on %bl
's previous value, potentially allowing a speed improvement even though it's nebulous in this case:
...
.Lj303:
movb $1,%bl
testw $256,72(%rbp)
je .Lj305
movb $3,%bl
.Lj305:
...
In the fpide
unit, an arithmetic instruction gets converted into a MOV
that then cascades down - before:
.section .text.n_fpide$_$tideapp_$__$$_memorysizes,"ax"
...
movl $21,%eax
movl $21,72(%rsp)
addl $10,%eax
movl %eax,80(%rsp)
...
After (in this case, there's no comparison to capitalise on... it does it independently):
.section .text.n_fpide$_$tideapp_$__$$_memorysizes,"ax"
...
movl $21,72(%rsp)
movl $31,80(%rsp)
...
In the gnutlssockets
unit, a SUB
gets optimised this time - before:
.section .text.n_gnutlssockets$_$tgnutlsx509certificate_$__$$_gensrvcert$hwdhavlhbo5k,"ax"
...
.Lj213:
movl $8192,%eax
movq $8192,-112(%rbp)
subq $1,%rax
movq %rax,-136(%rbp)
...
After (in thia case %eax
is not used afterwards, so its MOV
can be removed completely):
.section .text.n_gnutlssockets$_$tgnutlsx509certificate_$__$$_gensrvcert$hwdhavlhbo5k,"ax"
...
.Lj213:
movq $8192,-112(%rbp)
movq $8191,-136(%rbp)
...
In the assemble
unit, what was originally an AND
instruction that zeroes a single bit (converted to BTR
by the post-peephole stage) - before:
.globl ASSEMBLE$_$TINTERNALASSEMBLER_$__$$_MAKEOBJECT
ASSEMBLE$_$TINTERNALASSEMBLER_$__$$_MAKEOBJECT:
...
movl $268435455,%esi
cmpb $0,U_$GLOBALS_$$_USEDEFFILEFOREXPORTS(%rip)
je .Lj932
btrl $10,%esi
.Lj932:
... it gets converted into a MOV
, which is then optimised by the CMOVcc
optimisations to remove the conditional branch - after:
.globl ASSEMBLE$_$TINTERNALASSEMBLER_$__$$_MAKEOBJECT
ASSEMBLE$_$TINTERNALASSEMBLER_$__$$_MAKEOBJECT:
...
movl $268435455,%esi
movl $268434431,%eax
cmpb $0,U_$GLOBALS_$$_USEDEFFILEFOREXPORTS(%rip)
cmovnel %eax,%esi
A similar example occurs in the bufdataset
unit, albeit with an OR
instruction - before:
.section .text.n_bufdataset$_$tdatapackethandler_$__$$_bytetorowstate$byte$$trowstate,"ax"
.balign 16,0x90
.globl BUFDATASET$_$TDATAPACKETHANDLER_$__$$_BYTETOROWSTATE$BYTE$$TROWSTATE
BUFDATASET$_$TDATAPACKETHANDLER_$__$$_BYTETOROWSTATE$BYTE$$TROWSTATE:
.Lc1303:
xorl %eax,%eax
testb $1,%dl
je .Lj2001
orl $1,%eax
.Lj2001:
...
After (this whole sequence has a dependency chain length of only 2):
.section .text.n_bufdataset$_$tdatapackethandler_$__$$_bytetorowstate$byte$$trowstate,"ax"
.balign 16,0x90
.globl BUFDATASET$_$TDATAPACKETHANDLER_$__$$_BYTETOROWSTATE$BYTE$$TROWSTATE
BUFDATASET$_$TDATAPACKETHANDLER_$__$$_BYTETOROWSTATE$BYTE$$TROWSTATE:
.Lc1303:
xorl %eax,%eax
movl $1,%r8d
testb $1,%dl
cmovnel %r8d,%eax
...
In the cgobj
unit, the new SETcc
commit finds action outside of the initial CMP/JNE/MOV
optimisation - before:
.section .text.n_cgobj_$$_asmsym2indsymflags$tasmsymbol$$tindsymflags,"ax"
.balign 16,0x90
.globl CGOBJ_$$_ASMSYM2INDSYMFLAGS$TASMSYMBOL$$TINDSYMFLAGS
CGOBJ_$$_ASMSYM2INDSYMFLAGS$TASMSYMBOL$$TINDSYMFLAGS:
.Lc635:
xorb %al,%al
cmpb $1,45(%rcx)
je .Lj1020
orb $1,%al
.Lj1020:
cmpb $5,44(%rcx)
jne .Lj1022
orb $2,%al
.Lj1022:
.Lc636:
ret
.Lc634:
After:
.section .text.n_cgobj_$$_asmsym2indsymflags$tasmsymbol$$tindsymflags,"ax"
.balign 16,0x90
.globl CGOBJ_$$_ASMSYM2INDSYMFLAGS$TASMSYMBOL$$TINDSYMFLAGS
CGOBJ_$$_ASMSYM2INDSYMFLAGS$TASMSYMBOL$$TINDSYMFLAGS:
.Lc635:
cmpb $1,45(%rcx)
setneb %al
cmpb $5,44(%rcx)
jne .Lj1022
orb $2,%al
.Lj1022:
.Lc636:
ret
.Lc634:
In the fpwavreader
unit, a deterministic MOV
leads to a LEA
and a NOT
instruction being simplified in the same block - before:
.section .text.n_fpwavreader$_$twavreader_$__$$_loadfromstream$tstream$$boolean,"ax"
...
movl $1,%edx
leaq -16(%rax,%rdx),%rax
movl $1,%ecx
andq $2,%rcx
jne .Lj50
notq %rdx
andq %rax,%rdx
jmp .Lj51
.p2align 4,,10
.p2align 3
.Lj50:
...
After (a future merge request will extend this optimisation to simplify the xorl %ecx,%ecx
/testq %rcx,%rcx
pair, since %ecx
is definitely zero and the conditional jump will definitely not branch - in fact, %ecx
is not used after this point because the first instruction after .Lj51
is a write to %rcx
that doesn't depend on its previous value: movq 16(%rbx),%rcx
):
.section .text.n_fpwavreader$_$twavreader_$__$$_loadfromstream$tstream$$boolean,"ax"
...
movl $1,%edx
subq $15,%rax
xorl %ecx,%ecx
testq %rcx,%rcx
jne .Lj50
movq $-2,%rdx
andq %rax,%rdx
jmp .Lj51
.p2align 4,,10
.p2align 3
.Lj50:
...
In the mdt
unit, there are a couple of instances where a reference is deterministic, but one is worth more attention - before:
.section .text.n_mdt_$$_mdtgtr$hgf19mgl3f$h,"ax"
...
.Lj94:
cmpl %eax,-48(%rbp)
jne .Lj93
movl -48(%rbp),%eax
movq -32(%rbp),%rdx
comisd -8(%rdx,%rax,8),%xmm6
...
After:
.section .text.n_mdt_$$_mdtgtr$hgf19mgl3f$h,"ax"
...
.Lj94:
cmpl %eax,-48(%rbp)
jne .Lj93
andl %eax,%eax
movq -32(%rbp),%rdx
comisd -8(%rdx,%rax,8),%xmm6
...
The compiler doesn't have enough information to determine if the upper bits of %rax
are already zero, so the AND
instruction cannot be safely omitted. This would require some ability for the compiler to make arbitrary notes about the states of registers.