[x86] LEA expansion optimisation
Summary
This merge request expands 3-component LEA
instructions (i.e. contains an offset, a base register/symbol and an index register) into a LEA
and an ADD
instruction (the latter adds the offset) if the target register is used within 1 or 2 instructions (unless the instruction is a call or jump). This is because LEA
instructions that use all three components of a reference have a latency of 3 cycles before the destination can be read, whereas 2-component versions only have 1 cycle of latency.
This optimisation is not performed when optimising for size.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
At a small cost of code size, tight loops should now be very slightly faster.
Relevant logs and/or screenshots
In the system
unit - before:
.section .text.n_system_$$_utf8codepointlen$pansichar$int64$boolean$$int64,"ax"
...
leaq 245(%rcx,%r10),%r10
movzbl (%r10,%rsi,1),%esi
movl %r11d,%r10d
...
After - the remaining leaq (%rcx,%r10),%r10
gets converted into addq %rcx,%r10
(the two ADD
instructions are swapped in the post-peephole stage to help minimise a pipeline stall if %rcx
is not ready, although in this case it is almost certainly ready):
.section .text.n_system_$$_utf8codepointlen$pansichar$int64$boolean$$int64,"ax"
...
addq $245,%r10
addq %rcx,%r10
movzbl (%r10,%rsi,1),%esi
movl %r11d,%r10d
...
In this case, addq $245,%r10
could be merged into movzbl (%r10,%rsi,1),%esi
to create movzbl 245(%r10,%rsi,1),%esi
because %r10 gets overwritten afterwards (it would be a loss of performance though if %esi is used on the immediate next instruction). This will be something to optimise in a future patch.
In classes
, the expansion gets coupled with another optimisation - before:
.section .text.n_classes$_$tstrings_$__$$_setstrings$array_of_ansistring,"ax"
...
movq -16(%rbp),%rax
leaq 8(,%rax,8),%rdi
movq %rdi,%rcx
call fpc_getmem
movq %rax,%r12
movq %rdi,%r8
...
After - addq $8,%rdi
and movq %rdi,%rcx
gets changed to leaq 8(%rdi),%rcx
and add $8,%rdi
to minimise a pipeline stall while removing the 3-cycle latency between the original leaq 8(,%rax,8),%rdi
and movq %rdi,%rcx
:
.section .text.n_classes$_$tstrings_$__$$_setstrings$array_of_ansistring,"ax"
...
movq -16(%rbp),%rax
leaq (,%rax,8),%rdi
leaq 8(%rdi),%rcx
addq $8,%rdi
call fpc_getmem
movq %rax,%r12
movq %rdi,%r8 ...