Skip to content

[x86] LEA expansion optimisation

Summary

This merge request expands 3-component LEA instructions (i.e. contains an offset, a base register/symbol and an index register) into a LEA and an ADD instruction (the latter adds the offset) if the target register is used within 1 or 2 instructions (unless the instruction is a call or jump). This is because LEA instructions that use all three components of a reference have a latency of 3 cycles before the destination can be read, whereas 2-component versions only have 1 cycle of latency.

This optimisation is not performed when optimising for size.

System

  • Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

At a small cost of code size, tight loops should now be very slightly faster.

Relevant logs and/or screenshots

In the system unit - before:

.section .text.n_system_$$_utf8codepointlen$pansichar$int64$boolean$$int64,"ax"
	...
	leaq	245(%rcx,%r10),%r10
	movzbl	(%r10,%rsi,1),%esi
	movl	%r11d,%r10d
	...

After - the remaining leaq (%rcx,%r10),%r10 gets converted into addq %rcx,%r10 (the two ADD instructions are swapped in the post-peephole stage to help minimise a pipeline stall if %rcx is not ready, although in this case it is almost certainly ready):

.section .text.n_system_$$_utf8codepointlen$pansichar$int64$boolean$$int64,"ax"
	...
	addq	$245,%r10
	addq	%rcx,%r10
	movzbl	(%r10,%rsi,1),%esi
	movl	%r11d,%r10d
	...

In this case, addq $245,%r10 could be merged into movzbl (%r10,%rsi,1),%esi to create movzbl 245(%r10,%rsi,1),%esi because %r10 gets overwritten afterwards (it would be a loss of performance though if %esi is used on the immediate next instruction). This will be something to optimise in a future patch.


In classes, the expansion gets coupled with another optimisation - before:

.section .text.n_classes$_$tstrings_$__$$_setstrings$array_of_ansistring,"ax"
	...
	movq	-16(%rbp),%rax
	leaq	8(,%rax,8),%rdi
	movq	%rdi,%rcx
	call	fpc_getmem
	movq	%rax,%r12
	movq	%rdi,%r8
	...

After - addq $8,%rdi and movq %rdi,%rcx gets changed to leaq 8(%rdi),%rcx and add $8,%rdi to minimise a pipeline stall while removing the 3-cycle latency between the original leaq 8(,%rax,8),%rdi and movq %rdi,%rcx:

.section .text.n_classes$_$tstrings_$__$$_setstrings$array_of_ansistring,"ax"
	...
	movq	-16(%rbp),%rax
	leaq	(,%rax,8),%rdi
	leaq	8(%rdi),%rcx
	addq	$8,%rdi
	call	fpc_getmem
	movq	%rax,%r12
	movq	%rdi,%r8	...

Merge request reports