[x86_64] Improved bug fix for #41397 (!1154) · Merge requests · FPC / FPC / FPC Source

Summary

This merge request seeks to fix the bug first raised in #41397 by addressing the unexpected mov %reg,%reg that caused MovAdd2LeaAdd and MovSub2LeaSub to malfunction, aiming for a "prevention is better than cure" approach.

The unexpected mov %reg,%reg was caused by MovlMovq2MovlMovl 1 being moved to Pass 2 in !1075 (merged) and it not respecting the fact that if the original combination was movl %reg1,%reg2; movq %reg2,%reg1, the result will be movl %reg1,%reg2; movl %reg1,%reg1. When it was in Pass 1, the movl %reg1,%reg1 instruction was removed by Mov2Nop 1 in OptPass1MOV, but Pass 2 has no such catch-all.

As such, in this merge request, if movl %reg1,%reg1 would be produced, it instead removes the instruction there and then, with the debug message saying MovlMovq2MovlNop 1 instead.

Also, since mov %reg,%reg should never be generated, or only appear very briefly, MovAdd2LeaAdd and MovSub2LeaSub will now return internal errors if they detect them (instead of simply removing their previous patch). This will catch future situations where mov %reg,%reg is incorrectly generated and allowed to remain (if a genuine situation arises where the upper 32 bits of a register need to be zeroed, and %reg,%reg should be used instead).

System

Processor architecture: x86-64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

The fix for #41397 should now be more efficient overall.

Relevant logs and/or screenshots

In tw41397, the fix was previously this (x86_64-linux, -O4):

	...
.Lj9:
	movzwl	%r14w,%eax
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
	leal	-1(%rax),%ecx
# Peephole Optimization: SubMov2LeaSub
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
	leal	(%rax),%eax
# Peephole Optimization: SubMov2LeaSub
	subl	$1,%eax
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %ecx = %eax; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: MovzMovs2MovzMovz
# Peephole Optimization: Made 32-to-64-bit zero extension more efficient (MovlMovq2MovlMovl 1)
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
	movq	%rcx,%rdx
	sarq	$63,%rdx
	andq	$3,%rdx
	addq	%rdx,%rax
	sarq	$2,%rax
	imulq	$365,%rcx,%rdx
	...

After:

	...
.Lj9:
	movzwl	%r14w,%eax
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
	leal	-1(%rax),%ecx
# Peephole Optimization: SubMov2LeaSub
# Peephole Optimization: Changed 32-bit registers in reference to 64-bit (reduces instruction size)
	leal	-1(%rax),%edx
# Peephole Optimization: SubMov2LeaSub
	subl	$1,%eax
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %ecx = %eax; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: MovzMovs2MovzMovz
# Peephole Optimization: Made 32-to-64-bit zero extension more efficient (MovlMovq2MovlNop 1)
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: Made 32-to-64-bit zero extension more efficient (MovlMovq2MovlMovl 1)
	sarq	$63,%rdx
	andq	$3,%rdx
	addq	%rdx,%rax
	sarq	$2,%rax
	imulq	$365,%rcx,%rdx
	...

Compared to before, more optimal code is generated with movq %rcx,%rdx being removed and leal -1(%rax),%edx appearing in place of the awkward leal (%rax),%eax (generated from movl %eax,%eax).

Additional Notes

Moving MovlMovq2MovlMovl 1 to the post-peephole stage was attempted, but this proved to produce less efficient code overall.

[x86_64] Improved bug fix for #41397