[x86] Deterministic MOV optimisations (!596) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:cmp-jne-mov into main Feb 26, 2024

Summary

This merge request contains a number of new peephole optimisations where it can identify that a particular register value is a specific determined value and replace MOV %reg1,%reg instructions with MOV x,%reg2, and further do the same with arithmetic and logical instructions.

The first commit is the original CMP/JNE/MOV optimisation that unlocked a heap of deterministic potential, and the subsequent commits capitalise on this. Further improvements can be made, but this patch has gotten bloated enough as it is.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Speed gains are achieved by identifying registers with deterministic values across certain comparisons.

Additional notes

These optimisations tend to increase code size. Adjustments for -Os will be added at a later date.

Relevant logs and/or screenshots

A huge number of units gain an improvement. All log examples are under x86_64-win64, -O4.

In the classes unit, a reference rather than a register is deterministic - before:

.section .text.n_classes$_$twriter_$__$$_writeproperty$tpersistent$pointer,"ax"
	...
.Lj10217:
	cmpq	$0,-192(%rbp)
	jne	.Lj10219
	movq	-192(%rbp),%rax
	...

After (a second example appears later in the code, differing only by the offset):

.section .text.n_classes$_$twriter_$__$$_writeproperty$tpersistent$pointer,"ax"
	...
.Lj10217:
	cmpq	$0,-192(%rbp)
	jne	.Lj10219
	xorl	%eax,%eax
	...

In the cldrhelper unit, a reference once again is deterministic, only this time it permits the removal of a subsequent LEA instruction - before:

.section .text.n_cldrhelper$_$torderedcharacters_$__$$_insert$treorderunit$longint$$longint,"ax"
	...
	cmpl	$0,(%rcx)
	jne	.Lj1285
	movl	(%rcx),%eax
	leal	1(%eax),%edx
	call	CLDRHELPER$_$TORDEREDCHARACTERS_$__$$_ENSURESIZE$LONGINT
	...

After:

.section .text.n_cldrhelper$_$torderedcharacters_$__$$_insert$treorderunit$longint$$longint,"ax"
	...
	cmpl	$0,(%rcx)
	jne	.Lj1285
	movl	$1,%edx
	call	CLDRHELPER$_$TORDEREDCHARACTERS_$__$$_ENSURESIZE$LONGINT
	...

To showcase a simple case of a logical instruction being converted into a MOV, in the aasmcnst unit - before:

	...
.Lj303:
	movb	$1,%bl
	testw	$256,72(%rbp)
	je	.Lj305
	orb	$2,%bl
.Lj305:
	...

After - %bl is guaranteed to be equal to 1 when the OR instruction is reached, so its operand is merged with %bl's current value and converted to a MOV. Changing OR to MOV removes the dependency on %bl's previous value, potentially allowing a speed improvement even though it's nebulous in this case:

	...
.Lj303:
	movb	$1,%bl
	testw	$256,72(%rbp)
	je	.Lj305
	movb	$3,%bl
.Lj305:
	...

In the fpide unit, an arithmetic instruction gets converted into a MOV that then cascades down - before:

.section .text.n_fpide$_$tideapp_$__$$_memorysizes,"ax"
	...
	movl	$21,%eax
	movl	$21,72(%rsp)
	addl	$10,%eax
	movl	%eax,80(%rsp)
	...

After (in this case, there's no comparison to capitalise on... it does it independently):

.section .text.n_fpide$_$tideapp_$__$$_memorysizes,"ax"
	...
	movl	$21,72(%rsp)
	movl	$31,80(%rsp)
	...

In the gnutlssockets unit, a SUB gets optimised this time - before:

.section .text.n_gnutlssockets$_$tgnutlsx509certificate_$__$$_gensrvcert$hwdhavlhbo5k,"ax"
	...
.Lj213:
	movl	$8192,%eax
	movq	$8192,-112(%rbp)
	subq	$1,%rax
	movq	%rax,-136(%rbp)
	...

After (in thia case %eax is not used afterwards, so its MOV can be removed completely):

.section .text.n_gnutlssockets$_$tgnutlsx509certificate_$__$$_gensrvcert$hwdhavlhbo5k,"ax"
	...
.Lj213:
	movq	$8192,-112(%rbp)
	movq	$8191,-136(%rbp)
	...

In the assemble unit, what was originally an AND instruction that zeroes a single bit (converted to BTR by the post-peephole stage) - before:

.globl	ASSEMBLE$_$TINTERNALASSEMBLER_$__$$_MAKEOBJECT
ASSEMBLE$_$TINTERNALASSEMBLER_$__$$_MAKEOBJECT:
	...
	movl	$268435455,%esi
	cmpb	$0,U_$GLOBALS_$$_USEDEFFILEFOREXPORTS(%rip)
	je	.Lj932
	btrl	$10,%esi
.Lj932:

... it gets converted into a MOV, which is then optimised by the CMOVcc optimisations to remove the conditional branch - after:

.globl	ASSEMBLE$_$TINTERNALASSEMBLER_$__$$_MAKEOBJECT
ASSEMBLE$_$TINTERNALASSEMBLER_$__$$_MAKEOBJECT:
	...
	movl	$268435455,%esi
	movl	$268434431,%eax
	cmpb	$0,U_$GLOBALS_$$_USEDEFFILEFOREXPORTS(%rip)
	cmovnel	%eax,%esi

A similar example occurs in the bufdataset unit, albeit with an OR instruction - before:

.section .text.n_bufdataset$_$tdatapackethandler_$__$$_bytetorowstate$byte$$trowstate,"ax"
	.balign 16,0x90
.globl	BUFDATASET$_$TDATAPACKETHANDLER_$__$$_BYTETOROWSTATE$BYTE$$TROWSTATE
BUFDATASET$_$TDATAPACKETHANDLER_$__$$_BYTETOROWSTATE$BYTE$$TROWSTATE:
.Lc1303:
	xorl	%eax,%eax
	testb	$1,%dl
	je	.Lj2001
	orl	$1,%eax
.Lj2001:
	...

After (this whole sequence has a dependency chain length of only 2):

.section .text.n_bufdataset$_$tdatapackethandler_$__$$_bytetorowstate$byte$$trowstate,"ax"
	.balign 16,0x90
.globl	BUFDATASET$_$TDATAPACKETHANDLER_$__$$_BYTETOROWSTATE$BYTE$$TROWSTATE
BUFDATASET$_$TDATAPACKETHANDLER_$__$$_BYTETOROWSTATE$BYTE$$TROWSTATE:
.Lc1303:
	xorl	%eax,%eax
	movl	$1,%r8d
	testb	$1,%dl
	cmovnel	%r8d,%eax
	...

In the cgobj unit, the new SETcc commit finds action outside of the initial CMP/JNE/MOV optimisation - before:

.section .text.n_cgobj_$$_asmsym2indsymflags$tasmsymbol$$tindsymflags,"ax"
	.balign 16,0x90
.globl	CGOBJ_$$_ASMSYM2INDSYMFLAGS$TASMSYMBOL$$TINDSYMFLAGS
CGOBJ_$$_ASMSYM2INDSYMFLAGS$TASMSYMBOL$$TINDSYMFLAGS:
.Lc635:
	xorb	%al,%al
	cmpb	$1,45(%rcx)
	je	.Lj1020
	orb	$1,%al
.Lj1020:
	cmpb	$5,44(%rcx)
	jne	.Lj1022
	orb	$2,%al
.Lj1022:
.Lc636:
	ret
.Lc634:

After:

.section .text.n_cgobj_$$_asmsym2indsymflags$tasmsymbol$$tindsymflags,"ax"
	.balign 16,0x90
.globl	CGOBJ_$$_ASMSYM2INDSYMFLAGS$TASMSYMBOL$$TINDSYMFLAGS
CGOBJ_$$_ASMSYM2INDSYMFLAGS$TASMSYMBOL$$TINDSYMFLAGS:
.Lc635:
	cmpb	$1,45(%rcx)
	setneb	%al
	cmpb	$5,44(%rcx)
	jne	.Lj1022
	orb	$2,%al
.Lj1022:
.Lc636:
	ret
.Lc634:

In the fpwavreader unit, a deterministic MOV leads to a LEA and a NOT instruction being simplified in the same block - before:

.section .text.n_fpwavreader$_$twavreader_$__$$_loadfromstream$tstream$$boolean,"ax"
	...
	movl	$1,%edx
	leaq	-16(%rax,%rdx),%rax
	movl	$1,%ecx
	andq	$2,%rcx
	jne	.Lj50
	notq	%rdx
	andq	%rax,%rdx
	jmp	.Lj51
	.p2align 4,,10
	.p2align 3
.Lj50:
	...

After (a future merge request will extend this optimisation to simplify the xorl %ecx,%ecx/testq %rcx,%rcx pair, since %ecx is definitely zero and the conditional jump will definitely not branch - in fact, %ecx is not used after this point because the first instruction after .Lj51 is a write to %rcx that doesn't depend on its previous value: movq 16(%rbx),%rcx):

.section .text.n_fpwavreader$_$twavreader_$__$$_loadfromstream$tstream$$boolean,"ax"
	...
	movl	$1,%edx
	subq	$15,%rax
	xorl	%ecx,%ecx
	testq	%rcx,%rcx
	jne	.Lj50
	movq	$-2,%rdx
	andq	%rax,%rdx
	jmp	.Lj51
	.p2align 4,,10
	.p2align 3
.Lj50:
	...

In the mdt unit, there are a couple of instances where a reference is deterministic, but one is worth more attention - before:

.section .text.n_mdt_$$_mdtgtr$hgf19mgl3f$h,"ax"
	...
.Lj94:
	cmpl	%eax,-48(%rbp)
	jne	.Lj93
	movl	-48(%rbp),%eax
	movq	-32(%rbp),%rdx
	comisd	-8(%rdx,%rax,8),%xmm6
	...

After:

.section .text.n_mdt_$$_mdtgtr$hgf19mgl3f$h,"ax"
	...
.Lj94:
	cmpl	%eax,-48(%rbp)
	jne	.Lj93
	andl	%eax,%eax
	movq	-32(%rbp),%rdx
	comisd	-8(%rdx,%rax,8),%xmm6
	...

The compiler doesn't have enough information to determine if the upper bits of %rax are already zero, so the AND instruction cannot be safely omitted. This would require some ability for the compiler to make arbitrary notes about the states of registers.

Edited Feb 26, 2024 by J. Gareth "Kit" Moreton

[x86] Deterministic MOV optimisations

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Additional notes

Relevant logs and/or screenshots

Merge request reports