x86: Massive MOVS/ZX overhaul
Summary
This merge request puts together a number of optimisations for MOVZX and MOVSX/D instructions on i386 and x86_64 in an attempt to reduce instruction count and code size. It consists of a few additions separated into separate commits that can be cherry-picked if necessary (I've tried to squash bug fixes into the relevant commits, although I can't guarantee total success unless all of the commits are included). A summary of the changes:
- New optimisations for MOVZX/op combinations to reduce instruction size and minimise pipeline stalls.
- The OptPass2Movx routine now handles signed operations better and tries to favour smaller operands where possible.
- "movzbl ###.%ecx; shl %cl, ###" and other shift and rotate operations now have an optimisation to write to just %cl if it's deallocated afterwards (a "mov" that doesn't depend on %ecx in between the two instructions is also allowed, as this encapsulates the common sequence "1 shl x").
- A new MovxAndTest2Test optimisation to mirror the similar MovAndTest2Test optimisation that reduces instruction count when testing a bit in a memory location.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Improved code generation under i386 and x86_64 where zero and sign extension is concerned.
Relevant logs and/or screenshots
Example of operand size reduction in the System unit - before:
.section .text.n_system_$$_upcase$char$$char,"ax"
.balign 16,0x90
.globl SYSTEM_$$_UPCASE$CHAR$$CHAR
SYSTEM_$$_UPCASE$CHAR$$CHAR:
movzbl %cl,%eax
subl $97,%eax
cmpl $26,%eax
...
After:
.section .text.n_system_$$_upcase$char$$char,"ax"
.balign 16,0x90
.globl SYSTEM_$$_UPCASE$CHAR$$CHAR
SYSTEM_$$_UPCASE$CHAR$$CHAR:
movb %cl,%al
subb $97,%al
cmpb $26,%al
...
MovxAndTest2Test in action in the System unit - before:
.Lj1787:
cmpb $5,%sil
jb .Lj1587
seteb %cl
andb %bl,%cl
je .Lj1815
testw %r9w,%r9w
je .Lj1805
movswq %r9w,%rcx
movzbw -1(%rdx,%rcx,1),%cx
andw $1,%cx
jne .Lj1815
...
After:
.Lj1787:
cmpb $5,%sil
jb .Lj1587
seteb %cl
andb %bl,%cl
je .Lj1815
testw %r9w,%r9w
je .Lj1805
movswq %r9w,%rcx
testb $1,-1(%rdx,%rcx,1)
jne .Lj1815
...
Better handling of signed operands in OptPass2Movx permits the removal of movslq instructions in the System unit:
.Lj4346:
movzbl (%r15,%r14,1),%eax
andl $31,%eax
shll $6,%eax
movslq %eax,%rax ; <-- Gets removed because the most significant bit is always zero.
movq %rax,-40(%rbp)
movzbl 1(%r15,%r14,1),%eax
andl $63,%eax
orq %rax,-40(%rbp)
cmpq $127,-40(%rbp)
jnbe .Lj4344
movq $63,-40(%rbp)
jmp .Lj4344
.balign 16,0x90
.Lj4347:
movzbl (%r15,%r14,1),%eax
andl $15,%eax
shll $12,%eax
movslq %eax,%rax
movq %rax,-40(%rbp)
movzbl 1(%r15,%r14,1),%eax
andl $63,%eax
shll $6,%eax
movslq %eax,%rax ; <-- Gets removed because the most significant bit is always zero.
orq %rax,-40(%rbp)
movzbl 2(%r15,%r14,1),%eax
andl $63,%eax
orq %rax,-40(%rbp)
...
Similar to the removal of movslq instructions, this pair are converted into a regular movl instructions on account of the super-registers being different - before:
...
movq %rcx,%rbx
movq %r8,%rsi
movq %r9,%rdi
movzbl (%r8),%r12d
movzbl (%r9),%r13d
movslq %r12d,%rax
movslq %r13d,%rcx
addq %rcx,%rax
cmpq %rdx,%rax
jng .Lj407
...
After:
...
movq %rcx,%rbx
movq %r8,%rsi
movq %r9,%rdi
movzbl (%r8),%r12d
movzbl (%r9),%r13d
movl %r12d,%eax
movl %r13d,%ecx
addq %rcx,%rax
cmpq %rdx,%rax
jng .Lj407
...
Also potential to optimise this further with additinal work (%ecx isn't used afterwards):
...
movq %rcx,%rbx
movq %r8,%rsi
movq %r9,%rdi
movzbl (%r8),%r12d
movzbl (%r9),%r13d
leaq (%r13d,%r12d),%rax
cmpq %rdx,%rax
jng .Lj407
...
Some changes increase instruction count, but they also remove false dependenceis - for example, under i386-win32 in the System unit. Here, only %ax is used as part of the actual parameter, with the upper 16 bits being undefined; as such, andl $1023,%eax has a false dependency on that undefined region (which is partly because of the old OptPass2Movx code). Now it's replaced with "andw $1023,%ax; movzwl %ax,%eax", which then gets optimised to "andw $1023,%ax; cwtl" since the most significant bit of %ax is zero ("cwtl" is identical to "movswl %ax,%eax" but compiles into a single machine code byte... $98) - before:
.section .text.n_system_$$_makelangid$word$word$$word,"ax"
.balign 16,0x90
.globl SYSTEM_$$_MAKELANGID$WORD$WORD$$WORD
SYSTEM_$$_MAKELANGID$WORD$WORD$$WORD:
andl $1023,%eax
movzwl %dx,%edx
shll $10,%edx
orl %edx,%eax
ret
After:
.section .text.n_system_$$_makelangid$word$word$$word,"ax"
.balign 16,0x90
.globl SYSTEM_$$_MAKELANGID$WORD$WORD$$WORD
SYSTEM_$$_MAKELANGID$WORD$WORD$$WORD:
andw $1023,%ax
cwtl
movzwl %dx,%edx
shll $10,%edx
orl %edx,%eax
ret
A nice side-effect here is that "andl $1023,%eax" takes 6 bytes to encode, while "andw $1023,%ax; cwtl" only takes 4 + 1 = 5 bytes.