[x86] Fixed inaccuracy in the long-range MOV optimisations
Summary
This merge request fixes a minor inconsistency in some of the long-range deep optimisations performed by OptPass1MOV
. Previously, if an operand of a distant MOV
instruction got changed to minimise a pipeline stall (even if it's too far away to have any real benefit in that regard), the process would prematurely drop out and miss other potential optimisations, such as potentially simplifying conditional jumps and removing redundant assignments.
A second commit moves the MovMov2MovMov 2
(changes "mov [ref],reg1; mov [ref],reg2" into "mov [ref],reg1; mov reg1,reg2") into a separate routine so it can be called during the long-range deep optimisations. This is to catch a one-off inefficiency where if the first MOV
in the pair contains the source register in the reference and reg1 is the target register, the end result is less optimal code.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
Some potential optimisations under -O3 get missed.
What is the behavior after applying this patch?
More code gets successfully optimised.
Relevant logs and/or screenshots
Most units receive a change but are largely just a change of register to minimise the chance of a pipeline stall. In some cases though, it permits another optimisation to take place.
Under x86_64-win64 -O4 in the ServiceManager
unit - before:
...
.section .text.n_servicemanager_$$_allocdependencylist$ansistring$$pchar,"ax"
...
.Lc120:
.seh_stackalloc 40
.seh_endprologue
movq %rcx,%rbx
movq $0,32(%rsp)
testq %rcx,%rcx
je .Lj154
movq %rcx,%rsi
testq %rbx,%rbx
je .Lj155
movq -8(%rbx),%rsi
.Lj155:
movslq %esi,%rax
...
With the inefficiency fixed, the second TEST
instruction gets optimised and this caues a cascade that removes the conditional jump since it is guaranteed to be false (since if %rcx = 0, the control flow will jump to .Lj154
- after:
...
.section .text.n_servicemanager_$$_allocdependencylist$ansistring$$pchar,"ax"
...
.Lc120:
.seh_stackalloc 40
.seh_endprologue
movq %rcx,%rbx
movq $0,32(%rsp)
testq %rcx,%rcx
je .Lj154
movq -8(%rcx),%rsi
movslq %esi,%rax
...
The system
unit has a couple of situations where redundant writes are removed - before:
.section .text.n_fpc_widestr_compare,"ax"
...
movq %rcx,%rbx
movq %rdx,%rsi
...
.Lj3275:
movq %rcx,%r8
testq %rbx,%rbx
je .Lj3276
movl -4(%rbx),%r8d
shrq $1,%r8
.Lj3276:
movq %rdx,%rax
testq %rsi,%rsi
je .Lj3277
movl -4(%rsi),%eax
shrq $1,%rax
.Lj3277:
cmpq %r8,%rax
cmovlq %rax,%r8
movq %rsi,%rdx
movq %rbx,%rcx
call SYSTEM_$$_COMPAREWORD$formal$formal$INT64$$INT64
...
In this case, besides the change of register from %rbx to %rcx in many instructions, the two MOV
instructions immediately prior to the CALL
get removed since the registers contain the same values - after:
.section .text.n_fpc_widestr_compare,"ax"
...
movq %rcx,%rbx
movq %rdx,%rsi
...
.Lj3275:
movq %rcx,%r8
testq %rcx,%rcx
je .Lj3276
movl -4(%rcx),%r8d
shrq $1,%r8
.Lj3276:
movq %rdx,%rax
testq %rdx,%rdx
je .Lj3277
movl -4(%rdx),%eax
shrq $1,%rax
.Lj3277:
cmpq %r8,%rax
cmovlq %rax,%r8
call SYSTEM_$$_COMPAREWORD$formal$formal$INT64$$INT64
...
The Strutils
unit receives an optimisation similar to the aforementioned ServiceManager
unit - before:
.section .text.n_strutils_$$_ansistartsstr$ansistring$ansistring$$boolean,"ax"
...
movq %rcx,%rbx
movq $0,-8(%rbp)
testq %rcx,%rcx
je .Lj459
movq %rcx,%r8
testq %rbx,%rbx
je .Lj462
movq -8(%rbx),%r8
.Lj462:
leaq -8(%rbp),%rcx
After:
.section .text.n_strutils_$$_ansistartsstr$ansistring$ansistring$$boolean,"ax"
...
movq %rcx,%rbx
movq $0,-8(%rbp)
testq %rcx,%rcx
je .Lj459
movq -8(%rcx),%r8
leaq -8(%rbp),%rcx
...
Simple example in the classes
unit - before:
.seh_proc CLASSES$_$TLIST_$__$$_NOTIFY$POINTER$TLISTNOTIFICATION
...
movq %rcx,%rbx
...
.Lj2467:
movq %rdx,%r9
movl $2,%r8d
movq %rcx,%rdx
movq %rbx,%rcx
call CLASSES$_$TLIST_$__$$_FPONOTIFYOBSERVERS$TOBJECT$TFPOBSERVEDOPERATION$POINTER
...
The redundant write to %rbx gets removed - after:
.seh_proc CLASSES$_$TLIST_$__$$_NOTIFY$POINTER$TLISTNOTIFICATION
...
movq %rcx,%rbx
...
.Lj2467:
movq %rdx,%r9
movl $2,%r8d
movq %rcx,%rdx
call CLASSES$_$TLIST_$__$$_FPONOTIFYOBSERVERS$TOBJECT$TFPOBSERVEDOPERATION$POINTER
...
The second commit alone is able to provide improvements to the final assembly language - in bin2obj
for example - before:
...
.section .text.n_p$bin2obj$_$writememstream_$$_writestrln$shortstring,"ax"
...
movq %rcx,%rbx
movzbl (%rdx),%r8d
addq $1,%rdx
movq -8(%rcx),%rcx
movq -8(%rbx),%rax
movq (%rax),%rax
call *264(%rax)
...
.section .text.n_p$bin2obj$_$writememstream_$$_writestr$shortstring,"ax"
...
movq %rcx,%rax
movzbl (%rdx),%r8d
addq $1,%rdx
movq -8(%rcx),%rcx
movq -8(%rax),%rax
movq (%rax),%rax
call *264(%rax)
...
With MovMov2MovMov 2
being performed as soon as possible, the instructions are made more efficient - after:
...
.section .text.n_p$bin2obj$_$writememstream_$$_writestrln$shortstring,"ax"
...
movq %rcx,%rbx
movzbl (%rdx),%r8d
addq $1,%rdx
movq -8(%rcx),%rcx
movq (%rcx),%rax
call *264(%rax)
...
.section .text.n_p$bin2obj$_$writememstream_$$_writestr$shortstring,"ax"
...
movzbl (%rdx),%r8d
addq $1,%rdx
movq -8(%rcx),%rcx
movq (%rcx),%rax
call *264(%rax)
...