[x86 / Bug Fix] OptPass2Movx optimisation fix
Summary
This merge request fixes a bug first identified in #40003 (closed), where a movzwl; movzbl
pair were optimised incorrectly, either causing an incorrect value or a segmentation fault.
System
- Operating system: i386, x86_64
What is the current bug behavior?
In some situations under -O2
and above, the upper bits of a value read from memory and then shifted right by 16 would not be zero.
What is the behavior after applying this patch?
Value should now always be correct
Relevant logs and/or screenshots
Under -O2 -OoNOPEEPHOLE
, the original code attached to #40003 (closed) revealed the following on line 202 when disassembled.
# [202] W2 := LastForwardTable[Byte(ExpandedKey[I + 11] shr 16)];
movslq %eax,%rcx
movl 44(%rdx,%rcx,4),%ecx
shrl $16,%ecx
movzbl %cl,%ecx
When compiled with the trunk:
# [202] W2 := LastForwardTable[Byte(ExpandedKey[I + 11] shr 16)];
movslq %eax,%rcx
# Peephole Optimization: movl 44(%rdx,%rcx,4),%ecx; shrl $16,%ecx -> movzxwl 46(%rdx,%rcx,4),%ecx (MovShr/Sar2Movx)
movzwl 46(%rdx,%rcx,4),%ecx
# Peephole Optimization: Movzx2Nop 2
The "(MovShr/Sar2Movx)" optimisation itself is fine, but unfortunately this causes "Movzx2Nop" to be incorrectly applied. This is the fixed version, where the movzwl
instruction is changed to movzbl
:
# [202] W2 := LastForwardTable[Byte(ExpandedKey[I + 11] shr 16)];
movslq %eax,%rcx
# Peephole Optimization: movl 44(%rdx,%rcx,4),%ecx; shrl $16,%ecx -> movzxwl 46(%rdx,%rcx,4),%ecx (MovShr/Sar2Movx)
# Peephole Optimization: movzwl2movzbl 1
movzbl 46(%rdx,%rcx,4),%ecx
# Peephole Optimization: Movzx2Nop 2a
Additional notes
In Sysutils, one optimisation (same one appears twice) gets removed, and upon closer inspection I feel it is the correct course of action - before:
...
movzwl (%rax),%ecx
andl $-33,%ecx
# Peephole Optimization: Movzx2Nop 2
call fpc_char_to_uchar
movw %ax,(%r14)
.Lj292:
...
After:
...
movzwl (%rax),%ecx
andl $-33,%ecx
movzbl %cl,%ecx
call fpc_char_to_uchar
movw %ax,(%r14)
.Lj292:
...
Some instructions get moved around with this, but functionality is still the same, and out-of-order execution should mitigate any potential performance hits. For example, in fpimgcmn - before (extraneous peephole optimisation notes removed for clarity):
.section .text.n_fpimgcmn_$$_swap$word$$word,"ax"
.balign 16,0x90
.globl FPIMGCMN_$$_SWAP$WORD$$WORD
FPIMGCMN_$$_SWAP$WORD$$WORD:
.Lc2:
movw %cx,%ax
andw $255,%ax
shrw $8,%cx
# Peephole Optimization: Movzx2Nop 1
movzbl %al,%eax
shll $8,%eax
movzbl %cl,%ecx ; <-- On trunk, instruction is here.
addl %ecx,%eax
.Lc3:
ret
.Lc1:
After:
.section .text.n_fpimgcmn_$$_swap$word$$word,"ax"
.balign 16,0x90
.globl FPIMGCMN_$$_SWAP$WORD$$WORD
FPIMGCMN_$$_SWAP$WORD$$WORD:
.Lc2:
movw %cx,%ax
andw $255,%ax
shrw $8,%cx
# Peephole Optimization: movzwl2movzbl 1
movzbl %cl,%ecx ; <-- On fix, instruction is here.
movzbl %al,%eax
shll $8,%eax
# Peephole Optimization: Movzx2Nop 2a
addl %ecx,%eax
.Lc3:
ret
.Lc1:```