Skip to content

[x86_64] OptPass1MOV additions to allow more 32-bit to 64-bit zero-extensions

Summary

This merge request adds an extension to the MovMov2Mov family of peephole optimisations, where an intermediate register is removed, by also permitting some situations where something is stored in a 32-bit intermediate register, but its 64-bit counterpart is then written to a final destination. Under x86_64 rules, any writes to a 32-bit register automatically zeroes the upper 32 bits of the full 64-bit register.

When dealing with movl x,%regd; movq %regq,y, the following scenarios are valid optimisations:

  • x is a 32-bit immediate between 0 and 2^31 - 1, in which case the optimised instruction is movq x,y (negative values are not valid as they will be sign-extended).
  • y is another 64-bit register, in which case the optimised instruction is movl x,y and y is implicitly zero-extended (negative immediates are valid in this case since they don't get sign-extended, hence even if x is a register or a reference, it is a valid optimisation because the entire domain of x is valid).

System

  • Processor architecture: x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Extra peephole optimisations are now performed under x86_64 when a register is zero-extended from 32-bit to 64-bit.

Relevant logs and/or screenshots

A large number of compiled files receive improvements under x86_64-win64, -O4, and likely on lower optimisation settings too.

In cgobj, a simple example that appears in many source files - before:

.Lj443:
	...
	movl	(%r9),%eax
	movq	%rax,%r12

After:

.Lj443:
	...
	movl	(%r9),%r12d

In zip2stream, an immediate is zero-extended - before:

.section .text.n_bzip2stream$_$tdecompressbzip2stream_$__$$_receive_mtf_values,"ax"
	...
	xorl	%eax,%eax
	movq	%rax,32(%rsp)

After

.section .text.n_bzip2stream$_$tdecompressbzip2stream_$__$$_receive_mtf_values,"ax"
	...
	movq	$0,32(%rsp)

Examples with other constants appear in uregexpr - before:

.Lj2207:
	movl	$2147483647,%eax
	movq	%rax,128(%rsp)
	...
.Lj2208:
	xorl	%eax,%eax
	movq	%rax,112(%rsp)
	...
.Lj2212:
	movl	$1,%eax
	movq	%rax,112(%rsp)
	...
.Lj2248:
	xorl	%eax,%eax
	movq	%rax,112(%rsp)
	movl	$2147483647,%eax
	movq	%rax,128(%rsp)
	...
.Lj2249:
	movl	$1,%eax
	movq	%rax,112(%rsp)
	movl	$2147483647,%eax
	movq	%rax,128(%rsp)

After:

.Lj2207:
	movq	$2147483647,128(%rsp)
	...
.Lj2208:
	movq	$0,112(%rsp)
	...
.Lj2212:
	movq	$1,112(%rsp)
	...
.Lj2248:
	movq	$0,112(%rsp)
	movq	$2147483647,128(%rsp)
	...
.Lj2249:
	movq	$1,112(%rsp)
	movq	$2147483647,128(%rsp)

In jquant2, a heap of zeroes written to the stack are simplified - before:

.section .text.n_jquant2_$$_compute_color$j_decompress_ptr$box$longint,"ax"
	...
	xorl	%eax,%eax
	movq	%rax,(%rsp)
	movq	%rax,32(%rsp)
	xorl	%eax,%eax
	movq	%rax,24(%rsp)
	movq	%rax,8(%rsp)

After:

.section .text.n_jquant2_$$_compute_color$j_decompress_ptr$box$longint,"ax"
	...
	movq	$0,(%rsp)
	movq	$0,32(%rsp)
	movq	$0,24(%rsp)
	movq	$0,8(%rsp)

rax86 has the following sequence appearing 27 times (sometimes with a different address or branch destination) - before:

	...
	movq	-1160(%rbp),%rdx
	movzbl	63(%rdx),%ebx
	testl	%ebx,%ebx
	jle	.Lj211
	xorl	%eax,%eax
	movq	%rax,-1184(%rbp)
	...

After:

	...
	movq	-1160(%rbp),%rdx
	movzbl	63(%rdx),%ebx
	testl	%ebx,%ebx
	jle	.Lj211
	movq	$0,-1184(%rbp)
	...

mdt combines it with an unrelated AddMov2LeaAdd, having determined it is now feasible to make the latter optimisation to break the dependency chain and reduce cycle count - before:

.Lj224:
	movl	%esi,%edi
	addl	$1,%esi
	movl	$1,%eax
	movq	%rax,-104(%rbp)
	movq	-32(%rbp),%rax
	movl	%esi,%edx
	movsd	-8(%rax,%rdx,8),%xmm9

After (it is theoretically possible to improve this assembly language even further since, for example, the upper 32 bits of %rsi is zero:

.Lj224:
	movl	%esi,%edi
	leal	1(%esi),%edx
	addl	$1,%esi
	movq	$1,-104(%rbp)
	movq	-32(%rbp),%rax
	movsd	-8(%rax,%rdx,8),%xmm9
Edited by J. Gareth "Kit" Moreton

Merge request reports

Loading