[x86] Extensions to FuncMov2Func optimisation
Summary
This merge request contains three commits related to extending the FuncMov2Func optimisation:
-
FuncMov2Func
now works for 1, 3 and 4-operand instructions where the last operand is a pure write and the others are all pure reads. -
FuncMov2Func
has been moved to a separate procedure so it can also be called in Pass 2 since other Pass 2 optimisations can open up new opportunities. - Due to being called in Pass 2, the
FuncMov2Func
optimisation now checks to see if the end result ismov %reg,%reg
, which can happen if the function is another MOV instruction, since this gets missed in Pass 2 otherwise.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
FuncMov2Func
now optimises for 1, 3 and 4-operand instructions which are immediately followed by a MOV instruction.
Relevant logs and/or screenshots
Under x86_64-win64, -O4, in the defutils
unit, FuncMov2Mov
gets called twice in a row to collapse a dependency chain - before:
...
.Lj925:
leaq TC_$CGBASE$_$INT_CGSIZE$INT64$$TCGSIZE_$$_SIZE2CGSIZE(%rip),%r8
movb (%r8,%rax,1),%dl
movb %dl,%cl
movb %cl,%bl
jmp .Lj836
...
After:
...
.Lj925:
leaq TC_$CGBASE$_$INT_CGSIZE$INT64$$TCGSIZE_$$_SIZE2CGSIZE(%rip),%r8
movb (%r8,%rax,1),%bl
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
jmp .Lj836
...
In the System unit - before:
.section .text.n_system_$$_align$qword$qword$$qword,"ax"
.balign 16,0x90
.globl SYSTEM_$$_ALIGN$QWORD$QWORD$$QWORD
SYSTEM_$$_ALIGN$QWORD$QWORD$$QWORD:
.Lc162:
movq %rdx,%r8
leaq -1(%rdx),%rax
leaq (%rcx,%rax),%r9
andq %rax,%rdx
jne .Lj191
andn %r9,%rax,%rax
movq %rax,%rcx
# Peephole Optimization: Duplicated 1 assignment(s) and redirected jump
# Peephole Optimization: %rcx = %rax; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: Mov2Nop 4 done
ret
.p2align 4,,10
.p2align 3
.Lj191:
movq %r9,%rax
xorl %edx,%edx
divq %r8
subq %rdx,%r9
movq %r9,%rcx
movq %rcx,%rax
.Lc163:
ret
After:
.section .text.n_system_$$_align$qword$qword$$qword,"ax"
.balign 16,0x90
.globl SYSTEM_$$_ALIGN$QWORD$QWORD$$QWORD
SYSTEM_$$_ALIGN$QWORD$QWORD$$QWORD:
.Lc162:
movq %rdx,%r8
leaq -1(%rdx),%rax
leaq (%rcx,%rax),%r9
andq %rax,%rdx
jne .Lj191
andn %r9,%rax,%rcx
movq %rcx,%rax ; <-- Registers have been swapped, but this is not yet optimised into just "andn %r9,%rax,%rax" because %rcx is still tracked as 'in use'. A future merge request aims to correct this.
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
# Peephole Optimization: Duplicated 1 assignment(s) and redirected jump
ret
.p2align 4,,10
.p2align 3
.Lj191:
movq %r9,%rax
xorl %edx,%edx
divq %r8
subq %rdx,%r9
movq %r9,%rax
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
.Lc163:
ret
Additional notes
Sometimes, FuncMov2Func
now does the work of SETcc/MOV -> SETcc
, although there are still cases where the latter is performed instead.
Some new inefficiencies were identified thanks to this improved optimisation - for example, in the SysUtils
unit:
.Lj7599:
...
movb %sil,%dil
movb -4(%rbp),%dil
movb -4(%rbp),%sil
That first MOV instruction could be removed.
In unicodedata
- before (unrelated debug messages removed for clarity):
...
.Lj964:
movzwl (%r12),%ecx
leaq (%r12,%rcx),%rdx
movq %rdx,%r12
jmp .Lj957
.balign 16,0x90
...
After:
...
.Lj964:
movzwl (%r12),%ecx
leaq (%r12,%rcx),%r12
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
jmp .Lj957
.balign 16,0x90
...
In this situation, another run of Pass 2 would convert the LEA instruction into add %rcx,%r12
.