[x86_64] Improved bug fix for #41397
Summary
This merge request seeks to fix the bug first raised in #41397 by addressing the unexpected mov %reg,%reg that caused MovAdd2LeaAdd and MovSub2LeaSub to malfunction, aiming for a "prevention is better than cure" approach.
The unexpected mov %reg,%reg was caused by MovlMovq2MovlMovl 1 being moved to Pass 2 in !1075 (merged) and it not respecting the fact that if the original combination was movl %reg1,%reg2; movq %reg2,%reg1, the result will be movl %reg1,%reg2; movl %reg1,%reg1. When it was in Pass 1, the movl %reg1,%reg1 instruction was removed by Mov2Nop 1 in OptPass1MOV, but Pass 2 has no such catch-all.
As such, in this merge request, if movl %reg1,%reg1 would be produced, it instead removes the instruction there and then, with the debug message saying MovlMovq2MovlNop 1 instead.
Also, since mov %reg,%reg should never be generated, or only appear very briefly, MovAdd2LeaAdd and MovSub2LeaSub will now return internal errors if they detect them (instead of simply removing their previous patch). This will catch future situations where mov %reg,%reg is incorrectly generated and allowed to remain (if a genuine situation arises where the upper 32 bits of a register need to be zeroed, and %reg,%reg should be used instead).
System
- Processor architecture: x86-64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
The fix for #41397 should now be more efficient overall.
Relevant logs and/or screenshots
In tw41397, the fix was previously this (x86_64-linux, -O4):
...
.Lj9:
movzwl %r14w,%eax
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
leal -1(%rax),%ecx
# Peephole Optimization: SubMov2LeaSub
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
leal (%rax),%eax
# Peephole Optimization: SubMov2LeaSub
subl $1,%eax
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %ecx = %eax; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: MovzMovs2MovzMovz
# Peephole Optimization: Made 32-to-64-bit zero extension more efficient (MovlMovq2MovlMovl 1)
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
movq %rcx,%rdx
sarq $63,%rdx
andq $3,%rdx
addq %rdx,%rax
sarq $2,%rax
imulq $365,%rcx,%rdx
...
After:
...
.Lj9:
movzwl %r14w,%eax
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
leal -1(%rax),%ecx
# Peephole Optimization: SubMov2LeaSub
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
leal -1(%rax),%edx
# Peephole Optimization: SubMov2LeaSub
subl $1,%eax
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %ecx = %eax; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: MovzMovs2MovzMovz
# Peephole Optimization: Made 32-to-64-bit zero extension more efficient (MovlMovq2MovlNop 1)
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: Made 32-to-64-bit zero extension more efficient (MovlMovq2MovlMovl 1)
sarq $63,%rdx
andq $3,%rdx
addq %rdx,%rax
sarq $2,%rax
imulq $365,%rcx,%rdx
...
Compared to before, more optimal code is generated with movq %rcx,%rdx being removed and leal -1(%rax),%edx appearing in place of the awkward leal (%rax),%eax (generated from movl %eax,%eax).
Additional Notes
Moving MovlMovq2MovlMovl 1 to the post-peephole stage was attempted, but this proved to produce less efficient code overall.