x86: Massive MOVS/ZX overhaul (!93) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:movz-op-shortcut into main Dec 08, 2021

Summary

This merge request puts together a number of optimisations for MOVZX and MOVSX/D instructions on i386 and x86_64 in an attempt to reduce instruction count and code size. It consists of a few additions separated into separate commits that can be cherry-picked if necessary (I've tried to squash bug fixes into the relevant commits, although I can't guarantee total success unless all of the commits are included). A summary of the changes:

New optimisations for MOVZX/op combinations to reduce instruction size and minimise pipeline stalls.
The OptPass2Movx routine now handles signed operations better and tries to favour smaller operands where possible.
"movzbl ###.%ecx; shl %cl, ###" and other shift and rotate operations now have an optimisation to write to just %cl if it's deallocated afterwards (a "mov" that doesn't depend on %ecx in between the two instructions is also allowed, as this encapsulates the common sequence "1 shl x").
A new MovxAndTest2Test optimisation to mirror the similar MovAndTest2Test optimisation that reduces instruction count when testing a bit in a memory location.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Improved code generation under i386 and x86_64 where zero and sign extension is concerned.

Relevant logs and/or screenshots

Example of operand size reduction in the System unit - before:

.section .text.n_system_$$_upcase$char$$char,"ax"
	.balign 16,0x90
.globl	SYSTEM_$$_UPCASE$CHAR$$CHAR
SYSTEM_$$_UPCASE$CHAR$$CHAR:
	movzbl	%cl,%eax
	subl	$97,%eax
	cmpl	$26,%eax
        ...

After:

.section .text.n_system_$$_upcase$char$$char,"ax"
	.balign 16,0x90
.globl	SYSTEM_$$_UPCASE$CHAR$$CHAR
SYSTEM_$$_UPCASE$CHAR$$CHAR:
	movb	%cl,%al
	subb	$97,%al
	cmpb	$26,%al
        ...

MovxAndTest2Test in action in the System unit - before:

.Lj1787:
	cmpb	$5,%sil
	jb	.Lj1587
	seteb	%cl
	andb	%bl,%cl
	je	.Lj1815
	testw	%r9w,%r9w
	je	.Lj1805
	movswq	%r9w,%rcx
	movzbw	-1(%rdx,%rcx,1),%cx
	andw	$1,%cx
	jne	.Lj1815
        ...

After:

.Lj1787:
	cmpb	$5,%sil
	jb	.Lj1587
	seteb	%cl
	andb	%bl,%cl
	je	.Lj1815
	testw	%r9w,%r9w
	je	.Lj1805
	movswq	%r9w,%rcx
	testb	$1,-1(%rdx,%rcx,1)
	jne	.Lj1815
        ...

Better handling of signed operands in OptPass2Movx permits the removal of movslq instructions in the System unit:

.Lj4346:
	movzbl	(%r15,%r14,1),%eax
	andl	$31,%eax
	shll	$6,%eax
	movslq	%eax,%rax ; <-- Gets removed because the most significant bit is always zero.
	movq	%rax,-40(%rbp)
	movzbl	1(%r15,%r14,1),%eax
	andl	$63,%eax
	orq	%rax,-40(%rbp)
	cmpq	$127,-40(%rbp)
	jnbe	.Lj4344
	movq	$63,-40(%rbp)
	jmp	.Lj4344
	.balign 16,0x90
.Lj4347:
	movzbl	(%r15,%r14,1),%eax
	andl	$15,%eax
	shll	$12,%eax
	movslq	%eax,%rax
	movq	%rax,-40(%rbp)
	movzbl	1(%r15,%r14,1),%eax
	andl	$63,%eax
	shll	$6,%eax
	movslq	%eax,%rax ; <-- Gets removed because the most significant bit is always zero.
	orq	%rax,-40(%rbp)
	movzbl	2(%r15,%r14,1),%eax
	andl	$63,%eax
	orq	%rax,-40(%rbp)
        ...

Similar to the removal of movslq instructions, this pair are converted into a regular movl instructions on account of the super-registers being different - before:

        ...
	movq	%rcx,%rbx
	movq	%r8,%rsi
	movq	%r9,%rdi
	movzbl	(%r8),%r12d
	movzbl	(%r9),%r13d
	movslq	%r12d,%rax
	movslq	%r13d,%rcx
	addq	%rcx,%rax
	cmpq	%rdx,%rax
	jng	.Lj407
        ...

After:

        ...
	movq	%rcx,%rbx
	movq	%r8,%rsi
	movq	%r9,%rdi
	movzbl	(%r8),%r12d
	movzbl	(%r9),%r13d
	movl	%r12d,%eax
	movl	%r13d,%ecx
	addq	%rcx,%rax
	cmpq	%rdx,%rax
	jng	.Lj407
        ...

Also potential to optimise this further with additinal work (%ecx isn't used afterwards):

        ...
	movq	%rcx,%rbx
	movq	%r8,%rsi
	movq	%r9,%rdi
	movzbl	(%r8),%r12d
	movzbl	(%r9),%r13d
	leaq	(%r13d,%r12d),%rax
	cmpq	%rdx,%rax
	jng	.Lj407
        ...

Some changes increase instruction count, but they also remove false dependenceis - for example, under i386-win32 in the System unit. Here, only %ax is used as part of the actual parameter, with the upper 16 bits being undefined; as such, andl $1023,%eax has a false dependency on that undefined region (which is partly because of the old OptPass2Movx code). Now it's replaced with "andw $1023,%ax; movzwl %ax,%eax", which then gets optimised to "andw $1023,%ax; cwtl" since the most significant bit of %ax is zero ("cwtl" is identical to "movswl %ax,%eax" but compiles into a single machine code byte... $98) - before:

.section .text.n_system_$$_makelangid$word$word$$word,"ax"
	.balign 16,0x90
.globl	SYSTEM_$$_MAKELANGID$WORD$WORD$$WORD
SYSTEM_$$_MAKELANGID$WORD$WORD$$WORD:
	andl	$1023,%eax
	movzwl	%dx,%edx
	shll	$10,%edx
	orl	%edx,%eax
	ret

After:

.section .text.n_system_$$_makelangid$word$word$$word,"ax"
	.balign 16,0x90
.globl	SYSTEM_$$_MAKELANGID$WORD$WORD$$WORD
SYSTEM_$$_MAKELANGID$WORD$WORD$$WORD:
	andw	$1023,%ax
	cwtl
	movzwl	%dx,%edx
	shll	$10,%edx
	orl	%edx,%eax
	ret

A nice side-effect here is that "andl $1023,%eax" takes 6 bytes to encode, while "andw $1023,%ax; cwtl" only takes 4 + 1 = 5 bytes.

Edited Dec 21, 2021 by J. Gareth "Kit" Moreton

x86: Massive MOVS/ZX overhaul

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Merge request reports