Skip to content

[x86-64] 32-bit to 64-bit zero extension optimisation

Summary

This merge requeest optimises some occasional code where multiple MOVs are used to zero-extend a 32-bit register into a 64-bit register using code such as:

	movl	%esi,%eax
	movq	%rax,%r13

The peephole optimizer will now change this to:

	movl	%esi,%eax ; <-- This gets removed if %rax is not used afterwards
	movl	%esi,%r13d

If the source and target registers are the same, then an andl instruction is generated instead - for example:

	movl	%r12d,%eax
	movq	%rax,%r12

...becomes:

	movl	%r12d,%eax ; <-- This gets removed if %rax is not used afterwards
	andl	%r12d,%r12d

System

  • Processor architecture: x86-64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Code is made smaller and slightly faster where 32-bit to 64-bit zero extensions are concerned.

Relevant logs and/or screenshots

In fmtbcd (under -O4, x86_64-win64), before:

	...
.Lj1222:
	movb	%r13b,%dil
	movl	%esi,%eax
	movq	%rax,%r13
	addq	$1,%r13
	jno	.Lj1223
	call	FPC_OVERFLOW
.Lj1223:
	...

After:

	...
.Lj1222:
	movb	%r13b,%dil
	movl	%esi,%r13d
	addq	$1,%r13
	jno	.Lj1223
	call	FPC_OVERFLOW
.Lj1223:
	...

This is an interesting case for further future optimisation because if you study the code carefully, you'll note that jno .Lj1223 will always branch since %r13 will be between 0 and $FFFFFFFF before the addq $1,%r13 instruction.

In jdmainct, before:

	...
.Lj25:
	leal	1(%r8d),%edx
	addl	$1,%r8d
	...
	movl	%r8d,%edx
	movq	%rdx,%rsi
	movq	(%r9,%rdx,8),%rdx
	...

After:

	...
.Lj25:
	leal	1(%r8d),%edx
	addl	$1,%r8d
	...
	movl	%r8d,%esi
	movq	(%r9,%rsi,8),%rdx
	...

Here, the peephole optimizer is not yet smart enough to know that the upper 32 bits of %r8 are equal to zero, so can't replace %rsi in the reference with %r8 without the help of !74.

In the System unit, an example with the same register is demonstrated - before:

.Lj2266:
	movl	%r12d,%eax
	movq	%rax,%r12

After:

.Lj2266:
	andl	%r12d,%r12d

The Sysutils unit has a string of examples that showcase a more long-range version of the optimisation - before:

.section .text.n_sysutils$_$tmultireadexclusivewritesynchronizer_$__$$_threadidtoindex$longword$$longint,"ax"
	...
	movl	%edx,%eax
	movq	%rax,%rcx
	shrq	$12,%rcx
	xorl	%edx,%ecx
	movq	%rax,%rdx
	shrq	$32,%rdx
	xorl	%edx,%ecx
	movq	%rax,%rdx
	shrq	$36,%rdx
	...

After:

.section .text.n_sysutils$_$tmultireadexclusivewritesynchronizer_$__$$_threadidtoindex$longword$$longint,"ax"
	...
	movl	%edx,%eax
	movl	%edx,%ecx
	shrq	$12,%rcx
	xorl	%edx,%ecx
	andl	%edx,%edx
	shrq	$32,%rdx
	xorl	%edx,%ecx
	andl	%edx,%edx
	shrq	$36,%rdx
	...

This is another interesting future optimisation case because the SHR instructions with indices greater than or equal to 32 will zero the register.

Merge request reports

Loading