[x86] Improved shift/rotate code generation

Summary

This merge request adjusts the code generator for x86_64 so that it sets the byte-sized CL register where possible for variable shift and rotate instructions, thus creating more efficient code. Previously, the full register was used because i386 only supported byte-sized registers for EAX, EBX, ECX and EDX, but x86_64 doesn't have this restriction. Nevertheless, the code generator wasn't adapted for this; now that restriction has been lifted.

Additionally, a minor bug was fixed in the same section of code where ECX was deallocated one instruction too late, causing potential peephole optimisations to be missed.

Speaking of peephole optimisations, one bug fix has been applied to a Pass 2 optimisation involving shift/rotate instructions, and an i386-only peephole optimisation has been added in an attempt to shrink the allocations to ECX down to just CL if the source supports it, so as to match x86_64 (testing it on x86_64 shows no improvements).

"Cascade optimisations" are uncommon, but by focusing the optimisation in the code generator, it takes strain off of the peephole optimizer.

System

  • Processor architecture: i386, x86-64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Code centred around variable shift and rotate instructions is now more efficient.

Relevant logs and/or screenshots

A lot of code shows improvement under x86_64-win64, -O4, mostly with how it's generated. In bzip2 - before:

.globl	BZIP2$_$TBZIP2_DECODE_STREAM_$__$$_GET_BITS$BYTE$$BYTE
	...
	movq	%rdx,%rcx
	subq	%r8,%rcx
# Peephole Optimization: movq $8,%r8 -> movl $8,%r8d (immediate can be represented with just 32 bits)
	movl	$8,%r8d
	subq	%rcx,%r8
# Peephole Optimization: MovAnd2Mov 1 done
	movl	%r8d,%ecx
	shrl	%cl,%eax
	movzbl	84(%rbx),%r9d
# Peephole Optimization: MovOpMov2MovOp (movq subq movl)
	movl	$8,%ecx
	subl	%edx,%ecx
# Peephole Optimization: AND %reg,%reg proven unnecessary after backward search (And2Nop)
	shrl	%cl,%r9d
	orl	%r9d,%eax
	movzbl	32(%rsp),%r8d
	movzbl	85(%rbx),%ecx
	movzbl	%sil,%edx
	subq	%rcx,%rdx
# Peephole Optimization: MovAnd2Mov 1 done
	movl	%edx,%ecx
	shll	%cl,%r8d
	...

After (note that the peephole optimizer does not kick in anywhere near as often):

.globl	BZIP2$_$TBZIP2_DECODE_STREAM_$__$$_GET_BITS$BYTE$$BYTE
	...
	movq	%rdx,%rcx
	subq	%r8,%rcx
# Peephole Optimization: movq $8,%r8 -> movl $8,%r8d (immediate can be represented with just 32 bits)
	movl	$8,%r8d
	subq	%rcx,%r8
	movb	%r8b,%cl
	shrl	%cl,%eax
	movzbl	84(%rbx),%r9d
# Peephole Optimization: MovOpMov2MovOp (movq subq movb)
	movb	$8,%cl
	subb	%dl,%cl
	shrl	%cl,%r9d
	orl	%r9d,%eax
	movzbl	32(%rsp),%r8d
	movzbl	85(%rbx),%ecx
	movzbl	%sil,%edx
	subq	%rcx,%rdx
	movb	%dl,%cl
	shll	%cl,%r8d
	...

In jwawintype - before:

.section .text.n_jwawintype_$$_int64shramod32$int64$longword$$int64,"ax"
	.balign 16,0x90
.globl	JWAWINTYPE_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64
JWAWINTYPE_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64:
.Lc11:
	movq	%rcx,%rax
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %ecx = %edx; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: MovxOp2MovOp 1
	movb	%dl,%cl
	sarq	%cl,%rax
.Lc12:
	ret

After:

.section .text.n_jwawintype_$$_int64shramod32$int64$longword$$int64,"ax"
	.balign 16,0x90
.globl	JWAWINTYPE_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64
JWAWINTYPE_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64:
.Lc11:
	movq	%rcx,%rax
# Peephole Optimization: MovxOp2Op 1
	movb	%dl,%cl
	sarq	%cl,%rax
.Lc12:
	ret

For i386-win32, -O4, in the sfpux80 unit - before:

.globl	SFPUX80_$$_NORMALIZEFLOAT64SUBNORMAL$LONGWORD$LONGWORD$SMALLINT$LONGWORD$LONGWORD
	...
# Peephole Optimization: AndMovsxToAndMovzx
	movzbl	%cl,%ecx
	movl	%ebx,%eax
	shll	%cl,%eax

After - thanks to the register deallocation fix:

.globl	SFPUX80_$$_NORMALIZEFLOAT64SUBNORMAL$LONGWORD$LONGWORD$SMALLINT$LONGWORD$LONGWORD
	...
# Peephole Optimization: AndMovsxToAndMovzx
# Peephole Optimization: MovxOp2Op 3a
	movl	%ebx,%eax
	shll	%cl,%eax

Elsewhere, a common change - before:

	movl	%edx,%ecx
	shll	%cl,%eax

After:

# Peephole Optimization: Resized movmov to 8-bit to match shll instruction (MovOp2MovOp 1)
	movb	%dl,%cl
	shll	%cl,%eax

Additional Notes

In some cases, like in crc, performance seems to be hampered a bit under x86_64, but this seems more to do with inefficiencies with the register allocator, since there's no rhyme or reason for it to be necessary to be worse - before:

.section .text.n_crc_$$_$shr$u128$byte$$u128,"ax"
	.balign 16,0x90
.globl	CRC_$$_$shr$U128$BYTE$$U128
CRC_$$_$shr$U128$BYTE$$U128:
.Lc27:
	movq	%rcx,%rax
	movzbl	%r8b,%r9d
# Peephole Optimization: movq $64,%rcx -> movl $64,%ecx (immediate can be represented with just 32 bits)
	movl	$64,%ecx
	subq	%r9,%rcx
	movq	8(%rdx),%r10
	shlq	%cl,%r10
	movq	(%rdx),%r11
	movq	%r9,%rcx
	shrq	%cl,%r11
	orq	%r11,%r10
	movq	%r10,(%rax)
# Peephole Optimization: MovxOp2MovOp 1
	movb	%r8b,%cl
	movq	8(%rdx),%rdx
	shrq	%cl,%rdx
	movq	%rdx,8(%rax)
.Lc28:
	ret

After:

.section .text.n_crc_$$_$shr$u128$byte$$u128,"ax"
	.balign 16,0x90
.globl	CRC_$$_$shr$U128$BYTE$$U128
CRC_$$_$shr$U128$BYTE$$U128:
.Lc27:
	movq	%rcx,%rax
	movzbl	%r8b,%r9d
# Peephole Optimization: movq $64,%r10 -> movl $64,%r10d (immediate can be represented with just 32 bits)
	movl	$64,%r10d
	subq	%r9,%r10
	movq	8(%rdx),%r11
	movb	%r10b,%cl ; <-- For some reason, the register allocator sees fit to use %r10 as an intermediate rather than using %rcx directly.
	shlq	%cl,%r11
	movq	(%rdx),%r10
	movb	%r9b,%cl
	shrq	%cl,%r10
	orq	%r10,%r11
	movq	%r11,(%rax)
	movzbl	%r8b,%r8d; <-- "MovxOp2MovOp 1" doesn't get applied because of the instructions being rearranged below - this should be easily fixable though.
	movq	8(%rdx),%rdx
	movb	%r8b,%cl
	shrq	%cl,%rdx
	movq	%rdx,8(%rax)
.Lc28:
	ret

Maybe a bit more subtle in chmfiftimain - before:

.section .text.n_chmfiftimain_$$_writescalerootint$longword$longword$longint$$byte,"ax"
	...
.Lj29:
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
	leal	1(%r9),%ecx
# Peephole Optimization: AddMov2LeaAdd
	addl	$1,%r9d
# Peephole Optimization: MovAnd2Mov 1 done
# Peephole Optimization: MovMov2Mov 6 done
	movl	$1,%edi
# Peephole Optimization: %ebx = %r9d; changed to minimise pipeline stall (MovMov2Mov 6a}
# Peephole Optimization: MovAnd2Mov 1 done
	shll	%cl,%edi
	orl	%edi,%esi
	cmpl	%r9d,%r10d
	jnbe	.Lj29

After - though the instruction count is the same, the parallelisation is slightly worse due to the loss of the LEA instruction, but a peephole optimisation can possibly put this back:

.section .text.n_chmfiftimain_$$_writescalerootint$longword$longword$longint$$byte,"ax"
	...
.Lj29:
	addl	$1,%r9d
# Peephole Optimization: MovAnd2Mov 1 done
# Peephole Optimization: Mov2Nop 3b done
	movl	$1,%edi
# Peephole Optimization: %ebx = %r9d; changed to minimise pipeline stall (MovXXX2MovXXX)
	movb	%r9b,%cl
	shll	%cl,%edi
	orl	%edi,%esi
	cmpl	%r9d,%r10d
	jnbe	.Lj29

Most of the problems seem to be due to the register allocator, but others just require some small peephole optimizations that certainly aren't as expensive as the ones that don't need to execute now.

Edited by J. Gareth "Kit" Moreton

Merge request reports

Loading