[x86 / Bug Fix] OptPass2Movx optimisation fix (!336) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:movzwl-movzbl-optimisation-fix into main Nov 27, 2022

Summary

This merge request fixes a bug first identified in #40003 (closed), where a movzwl; movzbl pair were optimised incorrectly, either causing an incorrect value or a segmentation fault.

System

Operating system: i386, x86_64

What is the current bug behavior?

In some situations under -O2 and above, the upper bits of a value read from memory and then shifted right by 16 would not be zero.

What is the behavior after applying this patch?

Value should now always be correct

Relevant logs and/or screenshots

Under -O2 -OoNOPEEPHOLE, the original code attached to #40003 (closed) revealed the following on line 202 when disassembled.

# [202] W2 := LastForwardTable[Byte(ExpandedKey[I + 11] shr 16)];
	movslq	%eax,%rcx
	movl	44(%rdx,%rcx,4),%ecx
	shrl	$16,%ecx
	movzbl	%cl,%ecx

When compiled with the trunk:

# [202] W2 := LastForwardTable[Byte(ExpandedKey[I + 11] shr 16)];
	movslq	%eax,%rcx
# Peephole Optimization: movl 44(%rdx,%rcx,4),%ecx; shrl $16,%ecx -> movzxwl 46(%rdx,%rcx,4),%ecx (MovShr/Sar2Movx)
	movzwl	46(%rdx,%rcx,4),%ecx
# Peephole Optimization: Movzx2Nop 2

The "(MovShr/Sar2Movx)" optimisation itself is fine, but unfortunately this causes "Movzx2Nop" to be incorrectly applied. This is the fixed version, where the movzwl instruction is changed to movzbl:

# [202] W2 := LastForwardTable[Byte(ExpandedKey[I + 11] shr 16)];
	movslq	%eax,%rcx
# Peephole Optimization: movl 44(%rdx,%rcx,4),%ecx; shrl $16,%ecx -> movzxwl 46(%rdx,%rcx,4),%ecx (MovShr/Sar2Movx)
# Peephole Optimization: movzwl2movzbl 1
	movzbl	46(%rdx,%rcx,4),%ecx
# Peephole Optimization: Movzx2Nop 2a

Additional notes

In Sysutils, one optimisation (same one appears twice) gets removed, and upon closer inspection I feel it is the correct course of action - before:

	...
	movzwl	(%rax),%ecx
	andl	$-33,%ecx
# Peephole Optimization: Movzx2Nop 2
	call	fpc_char_to_uchar
	movw	%ax,(%r14)
.Lj292:
	...

After:

	...
	movzwl	(%rax),%ecx
	andl	$-33,%ecx
	movzbl	%cl,%ecx
	call	fpc_char_to_uchar
	movw	%ax,(%r14)
.Lj292:
	...

Some instructions get moved around with this, but functionality is still the same, and out-of-order execution should mitigate any potential performance hits. For example, in fpimgcmn - before (extraneous peephole optimisation notes removed for clarity):

.section .text.n_fpimgcmn_$$_swap$word$$word,"ax"
	.balign 16,0x90
.globl	FPIMGCMN_$$_SWAP$WORD$$WORD
FPIMGCMN_$$_SWAP$WORD$$WORD:
.Lc2:
	movw	%cx,%ax
	andw	$255,%ax
	shrw	$8,%cx
# Peephole Optimization: Movzx2Nop 1
 	movzbl	%al,%eax
	shll	$8,%eax
	movzbl	%cl,%ecx ; <-- On trunk, instruction is here.
	addl	%ecx,%eax
.Lc3:
	ret
.Lc1: