Skip to content

[x86] Advanced register simplification optimisation

Summary

This merge request seeks to remove temporary storage registers that are often inserted by the code generator (this is done because at this point it doesn't know if the sources and destinations are going to be stored in registers, on the stack or elsewhere in memory). It looks for constructs such as the following:

	movl	%reg1,%reg2
	(Instructions that transmute %reg2)
	movl	%reg2,%reg1
	(%reg2 deallocated)

And change it to:

	(Instructions that transmute %reg1)

This goes further than a simple peephole optimization in that it will actually deallocate %reg2 over this block so it can be reused.

This is not limited to mov %reg1,%reg2 ... mov %reg2,%reg1. In many situations, mov (ref),%reg2 ... mov %reg2,%reg1 and mov %reg3,%reg2 ... mov %reg2,%reg1 can be simplified in the same way (although more care is taken here).

System

  • Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Instructions are removed as the peephole optimizer attempts to eliminate temporary storage spaces.

Additional notes

Some fixes to RegInInstruction and RegModifiedByInstruction were required in order to handle (I)MUL and (I)DIV instructions more accurately.

ReplaceRegisterInInstruction was updated with a new parameter to permit changes to write and modification operands and not just read operands (this is what the simplification algorithm used).

This optimisation is only performed under -O3 and currently incurs a large speed penalty. This is because of the frequent looking ahead and looking backwards to confirm the state of registers in order to find more optimisations and to ensure the registers being simplified won't be affected by partial register reads/writes. This may be deemed unacceptable, but if and when !74 is approved in some form, I plan to use compiler hints to indicate the used size of registers (since a lot of peephole optimizations allocate the full register size, and allocation markers may not be nearby). Otherwise, I aim to refactor and optimise this optimisation to be faster later on.

Relevant logs and/or screenshots

A lot of the time, the end result are MOV instructions that have been rearranged, but a very large number of files receive improvements of some form.

(All examples are taken under x86_64-win64, -O4)

A common improvement that appears across a lot of object-oriented units (advancedipc is used as an example here) - before:

.globl	ADVANCEDIPC$_$TIPCCLIENT_$__$$_CREATE$TCOMPONENT$$TIPCCLIENT
ADVANCEDIPC$_$TIPCCLIENT_$__$$_CREATE$TCOMPONENT$$TIPCCLIENT:
	...
	movq	-24(%rbp),%rax
	movq	%rax,%rdx
	movq	%rax,%rcx
	call	*104(%rdx)
	...

After:

.globl	ADVANCEDIPC$_$TIPCCLIENT_$__$$_CREATE$TCOMPONENT$$TIPCCLIENT
ADVANCEDIPC$_$TIPCCLIENT_$__$$_CREATE$TCOMPONENT$$TIPCCLIENT:
	...
	movq	-24(%rbp),%rcx
	movq	%rcx,%rdx
	call	*104(%rcx)
	...

This construct appears very often.

The crc unit showcases mov %reg1,%reg2 ... mov %reg2,%reg1 quite nicely - before:

.Lj12:
	...
	movl	%ecx,%r10d
	shrl	$8,%r10d
	leaq	TC_$CRC_$$_CRC32_TABLE(%rip),%r11
	xorl	(%r11,%r9,4),%r10d
	movl	%r10d,%ecx
	...

After:

.Lj12:
	...
	shrl	$8,%ecx
	leaq	TC_$CRC_$$_CRC32_TABLE(%rip),%r11
	xorl	(%r11,%r9,4),%ecx
	...

An example in constexp that showcases mov (ref),%reg2 ... mov %reg2,%reg1 - before:

.globl	CONSTEXP_$$_ADD_TO$TCONSTEXPRINT$QWORD$$TCONSTEXPRINT
CONSTEXP_$$_ADD_TO$TCONSTEXPRINT$QWORD$$TCONSTEXPRINT:
	...
	movq	8(%rdx),%rax
	negq	%rax
	movq	$9223372036854775807,%r11
	addq	%r11,%rax
	movq	%rax,%r9

After:

.globl	CONSTEXP_$$_ADD_TO$TCONSTEXPRINT$QWORD$$TCONSTEXPRINT
CONSTEXP_$$_ADD_TO$TCONSTEXPRINT$QWORD$$TCONSTEXPRINT:
	...
	movq	8(%rdx),%r9
	negq	%r9
	movq	$9223372036854775807,%r11
	addq	%r11,%r9

In aasmcpu, the register simplification permits a later optimisation to create an ADD instruction instead of a larger LEA instruction - before:

.Lj2182:
	...
	movq	U_$AASMCPU_$$_INSTABMEMREFSIZEINFOCACHE(%rip),%rax
	movzwl	%r14w,%edx
	shlq	$5,%rdx
	leaq	(%rax,%rdx),%rcx
	...

After:

.Lj2182:
	...
	movq	U_$AASMCPU_$$_INSTABMEMREFSIZEINFOCACHE(%rip),%rcx
	movzwl	%r14w,%edx
	shlq	$5,%rdx
	addq	%rdx,%rcx
	...

Also in aasmcpu, three similar quartets get simplified - before:

	...
.Lj2276:
	movq	-648(%rbp),%rax
	orq	-616(%rbp),%rax
	movq	%rax,%rcx
	cmpq	$536870912,%rax ; <- This was changed from %rcx to %rax by DeepMOVOpt to minimise a pipeline stall
	...

After:

	...
.Lj2276:
	movq	-648(%rbp),%rcx
	orq	-616(%rbp),%rcx
	cmpq	$536870912,%rcx ; <- %rax bas been removed completely from this region
	...

In aasmcpu again, one optimisation looks suspect at first, but after closer analysis, the final values of %eax and %r15d are the same - before:

	...
.Lj478:
	movl	112(%rbp),%eax
	subl	%r14d,%eax
	leal	2(%eax),%r15d
	addl	$2,%eax
	...

After:

	...
.Lj478:
	movl	112(%rbp),%r15d
	subl	%r14d,%r15d
	leal	2(%r15d),%eax
	addl	$2,%r15d
	...

The system unit has a couple of cute ones - before:

.Lj256:
	...
	...
	movq	%r9,%r11
	subq	%rcx,%r11
	movq	%r11,%rdx
	sarq	$63,%rdx
	andq	$7,%rdx
	addq	%rdx,%r11
	sarq	$3,%r11
	movq	%r11,%rax
	...

After:

.Lj256:
	...
	movq	%r9,%rax
	subq	%rcx,%rax
	cqto
	andq	$7,%rdx
	addq	%rdx,%rax
	sarq	$3,%rax
	...

The removal of a TEST instruction (thanks to coupling with other peephole optimizations that permit the SUB instruction to take on its role) - before:

.Lj273:
	movq	(%rcx),%r11
	subq	(%rdx),%r11
	movq	%r11,%r10
	testq	%r11,%r11
	je	.Lj277
	...

After:

.Lj273:
	movq	(%rcx),%r10
	subq	(%rdx),%r10
	je	.Lj277
	...

And again, this time with a SHR instruction (this one is clearer with DEBUG_AOPTCPU since MOV instructions get removed and DeepMOVOpt changes registers) - before:

.Lj2064:
	...
# Peephole Optimization: Mov2Nop 3b done
	movq	$-3689348814741910323,%rdx
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
	mulx	%rcx,%rax,%rax
	shrq	$3,%rax
	movq	%rax,%rcx
# Peephole Optimization: %rcx = %rax; changed to minimise pipeline stall (MovXXX2MovXXX)
	testq	%rax,%rax
	jne	.Lj2064
	...

After:

.Lj2064:
	...
# Peephole Optimization: Simplification start, replacing %rax with %rcx
# Peephole Optimization: Mov2Nop 1a done
	movq	$-3689348814741910323,%rdx
# Peephole Optimization: Simplified register usage, replacing %rax with %rcx
	mulx	%rcx,%rcx,%rcx
# Peephole Optimization: Simplified register usage, replacing %rax with %rcx
	shrq	$3,%rcx
# Peephole Optimization: Simplification endpoint, replacing %rax with %rcx (MovMov2Mov 4)
	jne	.Lj2064
	...

The adler unit under -CpCOREAVX2 shows improvement (and this was the original example that I sought to optimise) - before

.Lj14:
	...
	movl	%r10d,%ebx
	movq	$281539415969072,%rdx
	mulx	%rbx,%rdx,%rdx
	imull	$65521,%edx
	subl	%edx,%ebx
	movl	%ebx,%r10d
	movl	%ecx,%ebx
	movq	$281539415969072,%rdx
	mulx	%rbx,%rdx,%rdx
	imull	$65521,%edx
	subl	%edx,%ebx
	movl	%ebx,%ecx
	...

After:

.Lj14:
	...
	andl	%r10d,%r10d
	movq	$281539415969072,%rdx
	mulx	%r10,%rdx,%rdx
	imull	$65521,%edx
	subl	%edx,%r10d
	andl	%ecx,%ecx
	movq	$281539415969072,%rdx
	mulx	%rcx,%rdx,%rdx
	imull	$65521,%edx
	subl	%edx,%ecx
	...

%ebx gets removed. Unfortunately the compiler cannot determine if the upper halves of %r10 and %rcx are zero without additional hints, so the AND instructions are necessary to ensure it doesn't affect the MULX instructions.

Future development

Some additional peephole optimisations may be required to clean up some inefficiencies that appear sometimes, such as adding SAR, SHR and SHL as acceptable instructions in the AndOp2Op optimisation in order to help the removal of and %reg,%reg.

In the Adler example, because %ebx has now been deallocated, a peephole optimization could attempt to change the write destination from %rdx to %rbx and its eventual merge into %r10d and %rcx, to permit the second movq $281539415969072,%rdx to be removed.

This style of optimisation is very fundamental and could easily be ported in some form to AArch64, among other platforms.

Edited by J. Gareth "Kit" Moreton

Merge request reports