[x86] Minor improvement to BMI2-based integer division
Summary
This merge request makes a minor improvement to !283 (merged) by moving the deallocation of EDX to just before the MOVX instructions. This permits EDX to be used as the destination of the MULX instructions and alleviate register pressure even further.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Register pressure is alleviated slightly more when building with ``-CpCOREAVX2```.
Relevant logs and/or screenshots
Once again, the Adler unit shows the most improvement (x86_64-win64, -O4) - before:
.seh_proc ADLER_$$_ADLER32$LONGWORD$PBYTE$LONGWORD$$LONGWORD
# Register rsp allocated
pushq %rbx
.seh_pushreg %rbx
.Lc3:
pushq %rsi
.seh_pushreg %rsi
.Lc4:
.seh_endprologue
...
# Register ebx allocated
movl %r10d,%ebx
# Register rdx allocated
movq $281539415969072,%rdx
# Register esi allocated
mulx %rbx,%rsi,%rsi
# Register rdx released
imull $65521,%esi
subl %esi,%ebx
# Register esi released
movl %ebx,%r10d
# Register ebx released
# Register ebx allocated
movl %ecx,%ebx
# Register rdx allocated
movq $281539415969072,%rdx
# Register esi allocated
mulx %rbx,%rsi,%rsi
# Register rdx released
imull $65521,%esi
subl %esi,%ebx
# Register esi released
movl %ebx,%ecx
# Register ebx released
...
.Lj3:
# Register rsi allocated
popq %rsi
.Lc5:
# Register rbx allocated
popq %rbx
.Lc6:
# Register rsp released
ret
Due to RDX being used as the destination, RSI is not used at all and doesn't have to be preserved with PUSH/POP - after:
...
# Register ebx allocated
movl %r10d,%ebx
# Register rdx allocated
movq $281539415969072,%rdx
mulx %rbx,%rdx,%rdx
imull $65521,%edx
subl %edx,%ebx
# Register edx released
movl %ebx,%r10d
# Register ebx released
# Register ebx allocated
movl %ecx,%ebx
# Register rdx allocated
movq $281539415969072,%rdx
mulx %rbx,%rdx,%rdx
imull $65521,%edx
subl %edx,%ebx
# Register edx released
movl %ebx,%ecx
# Register ebx released
...
.Lj3:
# Register rbx allocated
popq %rbx
.Lc4:
# Register rsp released
ret
In the Classes unit, a MOV gets removed - before:
CLASSES$_$TBINARYOBJECTWRITER_$__$$_WRITECURRENCY$CURRENCY:
...
movq $3777893186295716171,%rdx
# Register rax allocated
mulx %rsi,%rax,%rax
# Register rsi,rdx released
shrq $11,%rax
# Register rdx allocated
movq %rax,%rdx
# Register rax released
...
After:
CLASSES$_$TBINARYOBJECTWRITER_$__$$_WRITECURRENCY$CURRENCY:
...
movq $3777893186295716171,%rdx
mulx %rsi,%rdx,%rdx
# Register rsi released
shrq $11,%rdx
...
Unfortunately it's not perfect. In infodwrf, an extra MOV is added, although thanks to multiple ALUs, it shouldn't cause a slowdown - before:
LNFODWRF_$$_GETLINEINFO$QWORD$SHORTSTRING$SHORTSTRING$LONGINT$$BOOLEAN:
...
movb $0,64(%rsp)
# Register rdx allocated
movq $81422607485721415,%rdx
# Register rax allocated
mulx %rcx,%rax,%rax
# Register rdx released
# Register rflags allocated
addq %rcx,%rax
rcrq $1,%rax
# Register rflags released
shrq $10,%rax
imulq $2039,%rax
subq %rax,%rcx
# Register rax released
# Register r15 allocated
movq %rcx,%r15
# Register rdx allocated
imulq $528,%rcx,%rdx
# Register rax allocated
leaq U_$LNFODWRF_$$_LINEINFOCACHE(%rip),%rax
# Register rflags allocated
cmpq (%rax,%rdx),%r12 ; <-- %r12 gets set to %rcx at the start of the procedure
# Register rax,rdx released
jne .Lj329
# Register rflags released
# Register rdx allocated
imulq $528,%rcx,%rdx
# Register rax allocated
# Register rcx released
leaq U_$LNFODWRF_$$_LINEINFOCACHE(%rip),%rax
...
After (fixing this one would require somehow hinting the register allocator that it should use %rcx instead of %rax somehow):
LNFODWRF_$$_GETLINEINFO$QWORD$SHORTSTRING$SHORTSTRING$LONGINT$$BOOLEAN:
...
movb $0,64(%rsp)
# Register rax allocated
movq %rcx,%rax ; <-- The extra MOV
# Register rdx allocated
movq $81422607485721415,%rdx
mulx %rcx,%rdx,%rdx
# Register rflags allocated
addq %rcx,%rdx
rcrq $1,%rdx
# Register rflags released
shrq $10,%rdx
imulq $2039,%rdx
subq %rdx,%rax
# Register rdx released
# Register r15 allocated
movq %rax,%r15
# Register rdx allocated
imulq $528,%rax,%rdx
leaq U_$LNFODWRF_$$_LINEINFOCACHE(%rip),%rax
# Register rflags allocated
cmpq (%rax,%rdx),%rcx ; <-- Here, DeepMOVOpt changes %r12 to %rcx because %rcx hasn't changed value
# Register rax,rdx,rcx released
jne .Lj329
# Register rflags released
# Register rdx allocated
imulq $528,%r15,%rdx
# Register rax allocated
leaq U_$LNFODWRF_$$_LINEINFOCACHE(%rip),%rax
...
Additional notes
Because of the reuse of RDX, the improved optimisation in !285 is unable to remove the repeated constant. It might be possible to do some advanced peephole optimisations that take place before register allocation though so RDX gets preserved.
Additionally, there are situations where a register is copied, worked on and its result copied back into the original register; in this case we might as well just work with the original register (e.g. in the Adler example, both %r10d and %ecx are copied into %ebx in turn, and then after %ebx has been transmuted, the results are written back into %r10d and %ecx respectively. The peephole optimizer is not (yet) able to optimise this via DeepMOVOpt
because, for all it knows, the upper half of the 64-bit versions, %r10 and %rcx, may be non-zero (they're not in this case) and hence cause the '''MULX''' instructions to return different results. This requires additional information that the peephole optimizer can access, or better node-level optimisation, to alleviate.