Draft: [x86] Optimisations to ADD and SUB where references are concerned
Summary
This merge request makes some optimisations with additions and subtractions when the result is used later in a reference, either by moving the ADD/SUB after the reference to break the dependency chain, or remove the ADD/SUB instruction completely and just encode its result into the reference - for example - before:
addq $1,%rax
movzbl -1(%rsi,%rax,1),%edx
(%rax deallocated)
After:
movzbl (%rsi,%rax,1),%edx
Additionally, there have been cases a value is subtracted from a register, and immediately added back, without anything happening in between. An extension to DoSubAddOpt, along with its inclusion in OptPass1ADD, aims to iron out these inefficiencies (if the SUB/ADD pair equal zero, it changes it to a CMP instruction if the flags are in use rather than removing it completely).
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
A large number of optimisations are performed to the RTL and compiler, merging additions and subtractions into references.
Relevant logs and/or screenshots
In cmsgs, the reference optimisation has a very nice cascade effect, since it allows the replacement of %rdi with %rbx and a merging of all the add instructions - before:
.Lj119:
movq %rbx,%rdi
movb (%rbx),%al
movb %al,33(%rsp)
addq $1,%rdi
movb (%rdi),%al
movb %al,34(%rsp)
addq $1,%rdi
movb (%rdi),%al
movb %al,35(%rsp)
addq $1,%rdi
movb (%rdi),%al
movb %al,36(%rsp)
addq $1,%rdi
movb (%rdi),%al
movb %al,37(%rsp)
addq $1,%rdi
...
After:
.Lj119:
movq %rbx,%rdi
movb (%rbx),%al
movb %al,33(%rsp)
movb 1(%rbx),%al
movb %al,34(%rsp)
movb 2(%rbx),%al
movb %al,35(%rsp)
movb 3(%rbx),%al
movb %al,36(%rsp)
movb 4(%rbx),%al
addq $5,%rdi
movb %al,37(%rsp)
...
There is potential for improvement since "movq %rbx,%rdi" and "addq $5,%rdi", despite being far apart, could be merged into "leaq 5(%rbx),%rdi" - the only possible performance hit is that it uses an AGU rather than an ALU in the middle of a block of code where the AGUs are used quite heavily and the ALUs are used only to move values between memory locations.
The improvements to DoSubAddOpt helps to clean up some operations that cancel each other out - in rgobj, before:
...
movb %dil,%al
subb $1,%al
testb %al,%al
movb %dil,%al
subb $1,%al
addb $1,%al
movb %al,%dil
.p2align 4,,10
.p2align 3
.Lj720:
...
After:
...
movb %dil,%al
subb $1,%al
testb %al,%al
.p2align 4,,10
.p2align 3
.Lj720:
...
(Some work needs to be done elsewhere, since that TEST instruction is completely useless)
Given that some adds are merged and some required the update to DoSubAddOpt, there is likely a means to improve compiler performance with some refactoring and consolidating this optimisation into one place.
Additional Notes
In rare cases, some inefficiencies can occur - for example, in math - before:
...
.Lj259:
addq $1,%rax
addsd (%rcx,%rax,8),%xmm0
cmpq %rdx,%rax
jnge .Lj259
.Lj258:
ret
After:
...
.Lj259:
movapd %xmm0,%xmm1
addsd 8(%rcx,%rax,8),%xmm1
addq $1,%rax
movapd %xmm1,%xmm0
cmpq %rdx,%rax
jnge .Lj259
.Lj258:
ret
In this case, the repositioning of the "addq" instruction causes the MovapXOpMovapX2Op optimisation to fail. This particular situation is addressed in !129 (merged), which itself makes a large number of improvements elsewhere.