Skip to content

x86: LEA optimisation improvements (part 1)

Summary

This merge request improves optimisations in LEA instructions on x86 targets with a number of additions:

  • In Pass 1, if an ADD/SUB follows another ADD/SUB that modifies the same register (as long as the flags aren't in use) and one has a constant, the one with the constant is moved last as this tends to aid other optimisations.
  • Adjacent ADD/SUBs with constants, modifying the same register, are now merged in Pass 1.
  • In the post-peephole stage, if an ADD/SUB follows another ADD/SUB that modifies the same register (as long as the flags aren't in use) and one has a constant, the one with the constant is moved first to minimise pipeline stalls (waiting for the register being read to be loaded). This also undoes the optimisation in Pass 1 if it didn't help other optimisations.
  • New AddLea2Lea and SubLea2Lea optimisations in pass 1 that merge ADD/SUB const,%reg into the adjacent LEA instruction by modifying the LEA instruction's offset.
  • If AddMov2LeaAdd or AddMov2LeaSub is performed in Pass 2, it runs OptPass1ADD or OptPass1SUB (whichever is applicable) on the arithmetic instruction as it being moved after the MOV often allows it to be merged with a later instruction.
  • AddMov2Lea and SubMov2Lea can now have instructions between the arithmetic instruction and the MOV instruction (-O3 only).
  • New Lea2Shl optimisation that changes constructs like "lea(,%edx,4),%edx" to "shl 2,%edx".

System

  • Processor architecture: i386, x86-64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

No test regressions should occur, and many small improvements should appear around instructions that add or subtract immediates and registers to a single destination. Both speed and size should improve.

Relevant logs and/or screenshots

A large number of source files show improvements. Here are some cherry-picked examples:

In aasmcpu.s - before:

        ...
	shll	$6,%ecx
	leal	(,%edx,8),%edx
	orl	%edx,%ecx
	orl	%eax,%ecx
	movq	56(%rsp),%rax
	movb	%cl,4(%rax)
.Lj1033:
	movq	56(%rsp),%rdx
	movzbl	(%rdx),%eax
	addq	$1,%rax
	movq	56(%rsp),%rcx
	movzbl	1(%rcx),%edx
	addq	%rax,%rdx ; (gets converted from "leaq (%rax,%rdx),%rdx")
	movq	56(%rsp),%rax
        ...

After:

        ...
	shll	$6,%ecx
	shll	$3,%edx ; converted to shift instruction
	orl	%edx,%ecx
	orl	%eax,%ecx
	movq	56(%rsp),%rax
	movb	%cl,4(%rax)
.Lj1033:
	movq	56(%rsp),%rdx
	movzbl	(%rdx),%eax
	movq	56(%rsp),%rcx
	movzbl	1(%rcx),%edx
	leaq	1(%rax,%rdx),%rdx ; (Due to Addlea2Lea) 
	movq	56(%rsp),%rax
        ...

In aasmtai.s - before:

        ...
	movsbl	%al,%eax
	movsbl	%bl,%edx
	subl	%edx,%eax
	subl	$1,%eax
	movb	%al,%bl
.Lj1194:
        ...

After:

        ...
	movsbl	%al,%eax
	movsbl	%bl,%edx
	subl	$1,%eax ; This helps minimise the pipeline stall because %eax can be modified while $edx is still being loaded (assuming %eax and %edx couldn't be loaded simultaneously)
	subl	%edx,%eax
	movb	%al,%bl
.Lj1194:
        ...

In classes.s - before:

        ...
.Lj3920:
        ...
	addq	$1,%rdx
	leaq	-1(%rax,%rdx,1),%rdx
        ...

After:

        ...
.Lj3920:
        ...
	addq	%rax,%rdx ; Since the offset falls to zero, this allows the lea instruction to be converted into an add instruction.
        ...

Future work.

  • The Pass 2 LEA optimisations would likely get more work done if they were in Pass 1, but this requires a lot more experimentation and tinkering to ensure good optimisation and will be explored at a later date.
  • In the aasmcpu.s example, there is a case where "movq 56(%rsp),%rdx" is closely followed by "movq 56(%rsp),%rcx" - theoretically this could be optimised by replacing %rdx (since it gets immediately deallocated once it's dereferenced) with %rcx and only reading from this part of the stack once.
Edited by J. Gareth "Kit" Moreton

Merge request reports