[x86] Advanced register simplification optimisation
Summary
This merge request seeks to remove temporary storage registers that are often inserted by the code generator (this is done because at this point it doesn't know if the sources and destinations are going to be stored in registers, on the stack or elsewhere in memory). It looks for constructs such as the following:
movl %reg1,%reg2
(Instructions that transmute %reg2)
movl %reg2,%reg1
(%reg2 deallocated)
And change it to:
(Instructions that transmute %reg1)
This goes further than a simple peephole optimization in that it will actually deallocate %reg2 over this block so it can be reused.
This is not limited to mov %reg1,%reg2 ... mov %reg2,%reg1
. In many situations, mov (ref),%reg2 ... mov %reg2,%reg1
and mov %reg3,%reg2 ... mov %reg2,%reg1
can be simplified in the same way (although more care is taken here).
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Instructions are removed as the peephole optimizer attempts to eliminate temporary storage spaces.
Additional notes
Some fixes to RegInInstruction and RegModifiedByInstruction were required in order to handle (I)MUL
and (I)DIV
instructions more accurately.
ReplaceRegisterInInstruction was updated with a new parameter to permit changes to write and modification operands and not just read operands (this is what the simplification algorithm used).
This optimisation is only performed under -O3 and currently incurs a large speed penalty. This is because of the frequent looking ahead and looking backwards to confirm the state of registers in order to find more optimisations and to ensure the registers being simplified won't be affected by partial register reads/writes. This may be deemed unacceptable, but if and when !74 is approved in some form, I plan to use compiler hints to indicate the used size of registers (since a lot of peephole optimizations allocate the full register size, and allocation markers may not be nearby). Otherwise, I aim to refactor and optimise this optimisation to be faster later on.
Relevant logs and/or screenshots
A lot of the time, the end result are MOV
instructions that have been rearranged, but a very large number of files receive improvements of some form.
(All examples are taken under x86_64-win64, -O4)
A common improvement that appears across a lot of object-oriented units (advancedipc is used as an example here) - before:
.globl ADVANCEDIPC$_$TIPCCLIENT_$__$$_CREATE$TCOMPONENT$$TIPCCLIENT
ADVANCEDIPC$_$TIPCCLIENT_$__$$_CREATE$TCOMPONENT$$TIPCCLIENT:
...
movq -24(%rbp),%rax
movq %rax,%rdx
movq %rax,%rcx
call *104(%rdx)
...
After:
.globl ADVANCEDIPC$_$TIPCCLIENT_$__$$_CREATE$TCOMPONENT$$TIPCCLIENT
ADVANCEDIPC$_$TIPCCLIENT_$__$$_CREATE$TCOMPONENT$$TIPCCLIENT:
...
movq -24(%rbp),%rcx
movq %rcx,%rdx
call *104(%rcx)
...
This construct appears very often.
The crc unit showcases mov %reg1,%reg2 ... mov %reg2,%reg1
quite nicely - before:
.Lj12:
...
movl %ecx,%r10d
shrl $8,%r10d
leaq TC_$CRC_$$_CRC32_TABLE(%rip),%r11
xorl (%r11,%r9,4),%r10d
movl %r10d,%ecx
...
After:
.Lj12:
...
shrl $8,%ecx
leaq TC_$CRC_$$_CRC32_TABLE(%rip),%r11
xorl (%r11,%r9,4),%ecx
...
An example in constexp that showcases mov (ref),%reg2 ... mov %reg2,%reg1
- before:
.globl CONSTEXP_$$_ADD_TO$TCONSTEXPRINT$QWORD$$TCONSTEXPRINT
CONSTEXP_$$_ADD_TO$TCONSTEXPRINT$QWORD$$TCONSTEXPRINT:
...
movq 8(%rdx),%rax
negq %rax
movq $9223372036854775807,%r11
addq %r11,%rax
movq %rax,%r9
After:
.globl CONSTEXP_$$_ADD_TO$TCONSTEXPRINT$QWORD$$TCONSTEXPRINT
CONSTEXP_$$_ADD_TO$TCONSTEXPRINT$QWORD$$TCONSTEXPRINT:
...
movq 8(%rdx),%r9
negq %r9
movq $9223372036854775807,%r11
addq %r11,%r9
In aasmcpu, the register simplification permits a later optimisation to create an ADD
instruction instead of a larger LEA
instruction - before:
.Lj2182:
...
movq U_$AASMCPU_$$_INSTABMEMREFSIZEINFOCACHE(%rip),%rax
movzwl %r14w,%edx
shlq $5,%rdx
leaq (%rax,%rdx),%rcx
...
After:
.Lj2182:
...
movq U_$AASMCPU_$$_INSTABMEMREFSIZEINFOCACHE(%rip),%rcx
movzwl %r14w,%edx
shlq $5,%rdx
addq %rdx,%rcx
...
Also in aasmcpu, three similar quartets get simplified - before:
...
.Lj2276:
movq -648(%rbp),%rax
orq -616(%rbp),%rax
movq %rax,%rcx
cmpq $536870912,%rax ; <- This was changed from %rcx to %rax by DeepMOVOpt to minimise a pipeline stall
...
After:
...
.Lj2276:
movq -648(%rbp),%rcx
orq -616(%rbp),%rcx
cmpq $536870912,%rcx ; <- %rax bas been removed completely from this region
...
In aasmcpu again, one optimisation looks suspect at first, but after closer analysis, the final values of %eax and %r15d are the same - before:
...
.Lj478:
movl 112(%rbp),%eax
subl %r14d,%eax
leal 2(%eax),%r15d
addl $2,%eax
...
After:
...
.Lj478:
movl 112(%rbp),%r15d
subl %r14d,%r15d
leal 2(%r15d),%eax
addl $2,%r15d
...
The system unit has a couple of cute ones - before:
.Lj256:
...
...
movq %r9,%r11
subq %rcx,%r11
movq %r11,%rdx
sarq $63,%rdx
andq $7,%rdx
addq %rdx,%r11
sarq $3,%r11
movq %r11,%rax
...
After:
.Lj256:
...
movq %r9,%rax
subq %rcx,%rax
cqto
andq $7,%rdx
addq %rdx,%rax
sarq $3,%rax
...
The removal of a TEST
instruction (thanks to coupling with other peephole optimizations that permit the SUB
instruction to take on its role) - before:
.Lj273:
movq (%rcx),%r11
subq (%rdx),%r11
movq %r11,%r10
testq %r11,%r11
je .Lj277
...
After:
.Lj273:
movq (%rcx),%r10
subq (%rdx),%r10
je .Lj277
...
And again, this time with a SHR
instruction (this one is clearer with DEBUG_AOPTCPU
since MOV
instructions get removed and DeepMOVOpt changes registers) - before:
.Lj2064:
...
# Peephole Optimization: Mov2Nop 3b done
movq $-3689348814741910323,%rdx
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
mulx %rcx,%rax,%rax
shrq $3,%rax
movq %rax,%rcx
# Peephole Optimization: %rcx = %rax; changed to minimise pipeline stall (MovXXX2MovXXX)
testq %rax,%rax
jne .Lj2064
...
After:
.Lj2064:
...
# Peephole Optimization: Simplification start, replacing %rax with %rcx
# Peephole Optimization: Mov2Nop 1a done
movq $-3689348814741910323,%rdx
# Peephole Optimization: Simplified register usage, replacing %rax with %rcx
mulx %rcx,%rcx,%rcx
# Peephole Optimization: Simplified register usage, replacing %rax with %rcx
shrq $3,%rcx
# Peephole Optimization: Simplification endpoint, replacing %rax with %rcx (MovMov2Mov 4)
jne .Lj2064
...
The adler unit under -CpCOREAVX2
shows improvement (and this was the original example that I sought to optimise) - before
.Lj14:
...
movl %r10d,%ebx
movq $281539415969072,%rdx
mulx %rbx,%rdx,%rdx
imull $65521,%edx
subl %edx,%ebx
movl %ebx,%r10d
movl %ecx,%ebx
movq $281539415969072,%rdx
mulx %rbx,%rdx,%rdx
imull $65521,%edx
subl %edx,%ebx
movl %ebx,%ecx
...
After:
.Lj14:
...
andl %r10d,%r10d
movq $281539415969072,%rdx
mulx %r10,%rdx,%rdx
imull $65521,%edx
subl %edx,%r10d
andl %ecx,%ecx
movq $281539415969072,%rdx
mulx %rcx,%rdx,%rdx
imull $65521,%edx
subl %edx,%ecx
...
%ebx gets removed. Unfortunately the compiler cannot determine if the upper halves of %r10 and %rcx are zero without additional hints, so the AND
instructions are necessary to ensure it doesn't affect the MULX
instructions.
Future development
Some additional peephole optimisations may be required to clean up some inefficiencies that appear sometimes, such as adding SAR
, SHR
and SHL
as acceptable instructions in the AndOp2Op
optimisation in order to help the removal of and %reg,%reg
.
In the Adler example, because %ebx has now been deallocated, a peephole optimization could attempt to change the write destination from %rdx to %rbx and its eventual merge into %r10d and %rcx, to permit the second movq $281539415969072,%rdx
to be removed.
This style of optimisation is very fundamental and could easily be ported in some form to AArch64, among other platforms.