[x86] Improved shift/rotate code generation
Summary
This merge request adjusts the code generator for x86_64 so that it sets the byte-sized CL register where possible for variable shift and rotate instructions, thus creating more efficient code. Previously, the full register was used because i386 only supported byte-sized registers for EAX, EBX, ECX and EDX, but x86_64 doesn't have this restriction. Nevertheless, the code generator wasn't adapted for this; now that restriction has been lifted.
Additionally, a minor bug was fixed in the same section of code where ECX was deallocated one instruction too late, causing potential peephole optimisations to be missed.
Speaking of peephole optimisations, one bug fix has been applied to a Pass 2 optimisation involving shift/rotate instructions, and an i386-only peephole optimisation has been added in an attempt to shrink the allocations to ECX down to just CL if the source supports it, so as to match x86_64 (testing it on x86_64 shows no improvements).
"Cascade optimisations" are uncommon, but by focusing the optimisation in the code generator, it takes strain off of the peephole optimizer.
System
- Processor architecture: i386, x86-64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Code centred around variable shift and rotate instructions is now more efficient.
Relevant logs and/or screenshots
A lot of code shows improvement under x86_64-win64, -O4, mostly with how it's generated. In bzip2 - before:
.globl BZIP2$_$TBZIP2_DECODE_STREAM_$__$$_GET_BITS$BYTE$$BYTE
...
movq %rdx,%rcx
subq %r8,%rcx
# Peephole Optimization: movq $8,%r8 -> movl $8,%r8d (immediate can be represented with just 32 bits)
movl $8,%r8d
subq %rcx,%r8
# Peephole Optimization: MovAnd2Mov 1 done
movl %r8d,%ecx
shrl %cl,%eax
movzbl 84(%rbx),%r9d
# Peephole Optimization: MovOpMov2MovOp (movq subq movl)
movl $8,%ecx
subl %edx,%ecx
# Peephole Optimization: AND %reg,%reg proven unnecessary after backward search (And2Nop)
shrl %cl,%r9d
orl %r9d,%eax
movzbl 32(%rsp),%r8d
movzbl 85(%rbx),%ecx
movzbl %sil,%edx
subq %rcx,%rdx
# Peephole Optimization: MovAnd2Mov 1 done
movl %edx,%ecx
shll %cl,%r8d
...
After (note that the peephole optimizer does not kick in anywhere near as often):
.globl BZIP2$_$TBZIP2_DECODE_STREAM_$__$$_GET_BITS$BYTE$$BYTE
...
movq %rdx,%rcx
subq %r8,%rcx
# Peephole Optimization: movq $8,%r8 -> movl $8,%r8d (immediate can be represented with just 32 bits)
movl $8,%r8d
subq %rcx,%r8
movb %r8b,%cl
shrl %cl,%eax
movzbl 84(%rbx),%r9d
# Peephole Optimization: MovOpMov2MovOp (movq subq movb)
movb $8,%cl
subb %dl,%cl
shrl %cl,%r9d
orl %r9d,%eax
movzbl 32(%rsp),%r8d
movzbl 85(%rbx),%ecx
movzbl %sil,%edx
subq %rcx,%rdx
movb %dl,%cl
shll %cl,%r8d
...
In jwawintype - before:
.section .text.n_jwawintype_$$_int64shramod32$int64$longword$$int64,"ax"
.balign 16,0x90
.globl JWAWINTYPE_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64
JWAWINTYPE_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64:
.Lc11:
movq %rcx,%rax
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %ecx = %edx; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: MovxOp2MovOp 1
movb %dl,%cl
sarq %cl,%rax
.Lc12:
ret
After:
.section .text.n_jwawintype_$$_int64shramod32$int64$longword$$int64,"ax"
.balign 16,0x90
.globl JWAWINTYPE_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64
JWAWINTYPE_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64:
.Lc11:
movq %rcx,%rax
# Peephole Optimization: MovxOp2Op 1
movb %dl,%cl
sarq %cl,%rax
.Lc12:
ret
For i386-win32, -O4, in the sfpux80 unit - before:
.globl SFPUX80_$$_NORMALIZEFLOAT64SUBNORMAL$LONGWORD$LONGWORD$SMALLINT$LONGWORD$LONGWORD
...
# Peephole Optimization: AndMovsxToAndMovzx
movzbl %cl,%ecx
movl %ebx,%eax
shll %cl,%eax
After - thanks to the register deallocation fix:
.globl SFPUX80_$$_NORMALIZEFLOAT64SUBNORMAL$LONGWORD$LONGWORD$SMALLINT$LONGWORD$LONGWORD
...
# Peephole Optimization: AndMovsxToAndMovzx
# Peephole Optimization: MovxOp2Op 3a
movl %ebx,%eax
shll %cl,%eax
Elsewhere, a common change - before:
movl %edx,%ecx
shll %cl,%eax
After:
# Peephole Optimization: Resized movmov to 8-bit to match shll instruction (MovOp2MovOp 1)
movb %dl,%cl
shll %cl,%eax
Additional Notes
In some cases, like in crc, performance seems to be hampered a bit under x86_64, but this seems more to do with inefficiencies with the register allocator, since there's no rhyme or reason for it to be necessary to be worse - before:
.section .text.n_crc_$$_$shr$u128$byte$$u128,"ax"
.balign 16,0x90
.globl CRC_$$_$shr$U128$BYTE$$U128
CRC_$$_$shr$U128$BYTE$$U128:
.Lc27:
movq %rcx,%rax
movzbl %r8b,%r9d
# Peephole Optimization: movq $64,%rcx -> movl $64,%ecx (immediate can be represented with just 32 bits)
movl $64,%ecx
subq %r9,%rcx
movq 8(%rdx),%r10
shlq %cl,%r10
movq (%rdx),%r11
movq %r9,%rcx
shrq %cl,%r11
orq %r11,%r10
movq %r10,(%rax)
# Peephole Optimization: MovxOp2MovOp 1
movb %r8b,%cl
movq 8(%rdx),%rdx
shrq %cl,%rdx
movq %rdx,8(%rax)
.Lc28:
ret
After:
.section .text.n_crc_$$_$shr$u128$byte$$u128,"ax"
.balign 16,0x90
.globl CRC_$$_$shr$U128$BYTE$$U128
CRC_$$_$shr$U128$BYTE$$U128:
.Lc27:
movq %rcx,%rax
movzbl %r8b,%r9d
# Peephole Optimization: movq $64,%r10 -> movl $64,%r10d (immediate can be represented with just 32 bits)
movl $64,%r10d
subq %r9,%r10
movq 8(%rdx),%r11
movb %r10b,%cl ; <-- For some reason, the register allocator sees fit to use %r10 as an intermediate rather than using %rcx directly.
shlq %cl,%r11
movq (%rdx),%r10
movb %r9b,%cl
shrq %cl,%r10
orq %r10,%r11
movq %r11,(%rax)
movzbl %r8b,%r8d; <-- "MovxOp2MovOp 1" doesn't get applied because of the instructions being rearranged below - this should be easily fixable though.
movq 8(%rdx),%rdx
movb %r8b,%cl
shrq %cl,%rdx
movq %rdx,8(%rax)
.Lc28:
ret
Maybe a bit more subtle in chmfiftimain - before:
.section .text.n_chmfiftimain_$$_writescalerootint$longword$longword$longint$$byte,"ax"
...
.Lj29:
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
leal 1(%r9),%ecx
# Peephole Optimization: AddMov2LeaAdd
addl $1,%r9d
# Peephole Optimization: MovAnd2Mov 1 done
# Peephole Optimization: MovMov2Mov 6 done
movl $1,%edi
# Peephole Optimization: %ebx = %r9d; changed to minimise pipeline stall (MovMov2Mov 6a}
# Peephole Optimization: MovAnd2Mov 1 done
shll %cl,%edi
orl %edi,%esi
cmpl %r9d,%r10d
jnbe .Lj29
After - though the instruction count is the same, the parallelisation is slightly worse due to the loss of the LEA instruction, but a peephole optimisation can possibly put this back:
.section .text.n_chmfiftimain_$$_writescalerootint$longword$longword$longint$$byte,"ax"
...
.Lj29:
addl $1,%r9d
# Peephole Optimization: MovAnd2Mov 1 done
# Peephole Optimization: Mov2Nop 3b done
movl $1,%edi
# Peephole Optimization: %ebx = %r9d; changed to minimise pipeline stall (MovXXX2MovXXX)
movb %r9b,%cl
shll %cl,%edi
orl %edi,%esi
cmpl %r9d,%r10d
jnbe .Lj29
Most of the problems seem to be due to the register allocator, but others just require some small peephole optimizations that certainly aren't as expensive as the ones that don't need to execute now.