[x86_64] OptPass1MOV additions to allow more 32-bit to 64-bit zero-extensions
Summary
This merge request adds an extension to the MovMov2Mov family of peephole optimisations, where an intermediate register is removed, by also permitting some situations where something is stored in a 32-bit intermediate register, but its 64-bit counterpart is then written to a final destination. Under x86_64 rules, any writes to a 32-bit register automatically zeroes the upper 32 bits of the full 64-bit register.
When dealing with movl x,%regd; movq %regq,y, the following scenarios are valid optimisations:
- x is a 32-bit immediate between 0 and 2^31 - 1, in which case the optimised instruction is
movq x,y(negative values are not valid as they will be sign-extended). - y is another 64-bit register, in which case the optimised instruction is
movl x,yand y is implicitly zero-extended (negative immediates are valid in this case since they don't get sign-extended, hence even if x is a register or a reference, it is a valid optimisation because the entire domain of x is valid).
System
- Processor architecture: x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Extra peephole optimisations are now performed under x86_64 when a register is zero-extended from 32-bit to 64-bit.
Relevant logs and/or screenshots
A large number of compiled files receive improvements under x86_64-win64, -O4, and likely on lower optimisation settings too.
In cgobj, a simple example that appears in many source files - before:
.Lj443:
...
movl (%r9),%eax
movq %rax,%r12
After:
.Lj443:
...
movl (%r9),%r12d
In zip2stream, an immediate is zero-extended - before:
.section .text.n_bzip2stream$_$tdecompressbzip2stream_$__$$_receive_mtf_values,"ax"
...
xorl %eax,%eax
movq %rax,32(%rsp)
After
.section .text.n_bzip2stream$_$tdecompressbzip2stream_$__$$_receive_mtf_values,"ax"
...
movq $0,32(%rsp)
Examples with other constants appear in uregexpr - before:
.Lj2207:
movl $2147483647,%eax
movq %rax,128(%rsp)
...
.Lj2208:
xorl %eax,%eax
movq %rax,112(%rsp)
...
.Lj2212:
movl $1,%eax
movq %rax,112(%rsp)
...
.Lj2248:
xorl %eax,%eax
movq %rax,112(%rsp)
movl $2147483647,%eax
movq %rax,128(%rsp)
...
.Lj2249:
movl $1,%eax
movq %rax,112(%rsp)
movl $2147483647,%eax
movq %rax,128(%rsp)
After:
.Lj2207:
movq $2147483647,128(%rsp)
...
.Lj2208:
movq $0,112(%rsp)
...
.Lj2212:
movq $1,112(%rsp)
...
.Lj2248:
movq $0,112(%rsp)
movq $2147483647,128(%rsp)
...
.Lj2249:
movq $1,112(%rsp)
movq $2147483647,128(%rsp)
In jquant2, a heap of zeroes written to the stack are simplified - before:
.section .text.n_jquant2_$$_compute_color$j_decompress_ptr$box$longint,"ax"
...
xorl %eax,%eax
movq %rax,(%rsp)
movq %rax,32(%rsp)
xorl %eax,%eax
movq %rax,24(%rsp)
movq %rax,8(%rsp)
After:
.section .text.n_jquant2_$$_compute_color$j_decompress_ptr$box$longint,"ax"
...
movq $0,(%rsp)
movq $0,32(%rsp)
movq $0,24(%rsp)
movq $0,8(%rsp)
rax86 has the following sequence appearing 27 times (sometimes with a different address or branch destination) - before:
...
movq -1160(%rbp),%rdx
movzbl 63(%rdx),%ebx
testl %ebx,%ebx
jle .Lj211
xorl %eax,%eax
movq %rax,-1184(%rbp)
...
After:
...
movq -1160(%rbp),%rdx
movzbl 63(%rdx),%ebx
testl %ebx,%ebx
jle .Lj211
movq $0,-1184(%rbp)
...
mdt combines it with an unrelated AddMov2LeaAdd, having determined it is now feasible to make the latter optimisation to break the dependency chain and reduce cycle count - before:
.Lj224:
movl %esi,%edi
addl $1,%esi
movl $1,%eax
movq %rax,-104(%rbp)
movq -32(%rbp),%rax
movl %esi,%edx
movsd -8(%rax,%rdx,8),%xmm9
After (it is theoretically possible to improve this assembly language even further since, for example, the upper 32 bits of %rsi is zero:
.Lj224:
movl %esi,%edi
leal 1(%esi),%edx
addl $1,%esi
movq $1,-104(%rbp)
movq -32(%rbp),%rax
movsd -8(%rax,%rdx,8),%xmm9