Skip to content

Draft: [x86] Optimisations to ADD and SUB where references are concerned

Summary

This merge request makes some optimisations with additions and subtractions when the result is used later in a reference, either by moving the ADD/SUB after the reference to break the dependency chain, or remove the ADD/SUB instruction completely and just encode its result into the reference - for example - before:

	addq	$1,%rax
	movzbl	-1(%rsi,%rax,1),%edx
	(%rax deallocated)

After:

	movzbl	(%rsi,%rax,1),%edx

Additionally, there have been cases a value is subtracted from a register, and immediately added back, without anything happening in between. An extension to DoSubAddOpt, along with its inclusion in OptPass1ADD, aims to iron out these inefficiencies (if the SUB/ADD pair equal zero, it changes it to a CMP instruction if the flags are in use rather than removing it completely).

System

  • Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

A large number of optimisations are performed to the RTL and compiler, merging additions and subtractions into references.

Relevant logs and/or screenshots

In cmsgs, the reference optimisation has a very nice cascade effect, since it allows the replacement of %rdi with %rbx and a merging of all the add instructions - before:

.Lj119:
	movq	%rbx,%rdi
	movb	(%rbx),%al
	movb	%al,33(%rsp)
	addq	$1,%rdi
	movb	(%rdi),%al
	movb	%al,34(%rsp)
	addq	$1,%rdi
	movb	(%rdi),%al
	movb	%al,35(%rsp)
	addq	$1,%rdi
	movb	(%rdi),%al
	movb	%al,36(%rsp)
	addq	$1,%rdi
	movb	(%rdi),%al
	movb	%al,37(%rsp)
	addq	$1,%rdi
	...

After:

.Lj119:
	movq	%rbx,%rdi
	movb	(%rbx),%al
	movb	%al,33(%rsp)
	movb	1(%rbx),%al
	movb	%al,34(%rsp)
	movb	2(%rbx),%al
	movb	%al,35(%rsp)
	movb	3(%rbx),%al
	movb	%al,36(%rsp)
	movb	4(%rbx),%al
	addq	$5,%rdi
	movb	%al,37(%rsp)
	...

There is potential for improvement since "movq %rbx,%rdi" and "addq $5,%rdi", despite being far apart, could be merged into "leaq 5(%rbx),%rdi" - the only possible performance hit is that it uses an AGU rather than an ALU in the middle of a block of code where the AGUs are used quite heavily and the ALUs are used only to move values between memory locations.

The improvements to DoSubAddOpt helps to clean up some operations that cancel each other out - in rgobj, before:

	...
	movb	%dil,%al
	subb	$1,%al
	testb	%al,%al
	movb	%dil,%al
	subb	$1,%al
	addb	$1,%al
	movb	%al,%dil
	.p2align 4,,10
	.p2align 3
.Lj720:
	...

After:

	...
	movb	%dil,%al
	subb	$1,%al
	testb	%al,%al
	.p2align 4,,10
	.p2align 3
.Lj720:	
	...

(Some work needs to be done elsewhere, since that TEST instruction is completely useless)

Given that some adds are merged and some required the update to DoSubAddOpt, there is likely a means to improve compiler performance with some refactoring and consolidating this optimisation into one place.

Additional Notes

In rare cases, some inefficiencies can occur - for example, in math - before:

	...
.Lj259:
	addq	$1,%rax
	addsd	(%rcx,%rax,8),%xmm0
	cmpq	%rdx,%rax
	jnge	.Lj259
.Lj258:
	ret

After:

	...
.Lj259:
	movapd	%xmm0,%xmm1
	addsd	8(%rcx,%rax,8),%xmm1
	addq	$1,%rax
	movapd	%xmm1,%xmm0
	cmpq	%rdx,%rax
	jnge	.Lj259
.Lj258:
	ret

In this case, the repositioning of the "addq" instruction causes the MovapXOpMovapX2Op optimisation to fail. This particular situation is addressed in !129 (merged), which itself makes a large number of improvements elsewhere.

Edited by J. Gareth "Kit" Moreton

Merge request reports