[Cross-platform] Sub-register (field access) optimisation (!304) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:subset-and-removal into main Oct 20, 2022

Summary

Following on from some comments in !282 (merged), this merge request updates the a_load_subsetreg_reg code generation routine to not generate an AND instruction after a SHR instruction if it's unnecessary (i.e. there are no bits to the left that need masking out).

System

Operating system: All
Processor architecture: All (although x86 has an extra related peephole optimization)

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Unnecessary AND instructions on code relating to reading structure fields are not generated.

Additional notes

A related optimisation in the x86 peephole optimizer was adjusted since the above change caused a single inefficiency in the cgobj assembly. The fix ended up improving generated code elsewhere. For example, under x86_64-win64, -O3 in the blowfish unit - before:

.section .text.n_blowfish$_$tblowfish_$__$$_f$longword$$longword,"ax"
	.balign 16,0x90
.globl	BLOWFISH$_$TBLOWFISH_$__$$_F$LONGWORD$$LONGWORD
BLOWFISH$_$TBLOWFISH_$__$$_F$LONGWORD$$LONGWORD:
.Lc6:
	movl	%edx,%r8d
	andl	$255,%r8d
	shrl	$8,%edx
	movl	%edx,%r9d
	andl	$255,%r9d
	shrl	$8,%edx
	movl	%edx,%eax
	andl	$255,%eax
	shrl	$8,%edx
# Peephole Optimization: AndMovzToAnd done
	andl	$255,%edx
	movzbl	%al,%eax
	...

After (the peephole optimizer is able to track through all of the SHR instructions and note the final AND instruction is unnecessary):

.section .text.n_blowfish$_$tblowfish_$__$$_f$longword$$longword,"ax"
	.balign 16,0x90
.globl	BLOWFISH$_$TBLOWFISH_$__$$_F$LONGWORD$$LONGWORD
BLOWFISH$_$TBLOWFISH_$__$$_F$LONGWORD$$LONGWORD:
.Lc6:
	movl	%edx,%r8d
	andl	$255,%r8d
	shrl	$8,%edx
	movl	%edx,%r9d
	andl	$255,%r9d
	shrl	$8,%edx
	movl	%edx,%eax
	andl	$255,%eax
	shrl	$8,%edx
# Peephole Optimization: AndMovzToAnd done
# Peephole Optimization: Removed AND instruction since previous SHR makes this an identity operation (ShrAnd2Shr)
	movzbl	%al,%eax
	...

In sfpu28, a zero-extension is stripped - before:

.globl	SFPU128_$$_MUL32TO64$LONGWORD$LONGWORD$LONGWORD$LONGWORD
SFPU128_$$_MUL32TO64$LONGWORD$LONGWORD$LONGWORD$LONGWORD:
.Lc108:
.seh_proc SFPU128_$$_MUL32TO64$LONGWORD$LONGWORD$LONGWORD$LONGWORD
	pushq	%rbx
.seh_pushreg %rbx
.Lc109:
.seh_endprologue
	movw	%cx,%ax
	shrl	$16,%ecx
	movw	%dx,%r10w
	shrl	$16,%edx
	movzwl	%ax,%r11d
	movzwl	%r10w,%ebx
	imull	%ebx,%r11d
	movzwl	%ax,%eax
	movzwl	%dx,%ebx
	imull	%ebx,%eax
	movl	%ecx,%ebx
	movzwl	%r10w,%r10d
	imull	%r10d,%ebx
	movzwl	%cx,%ecx
	movzwl	%dx,%edx
	imull	%edx,%ecx
	...

After:

.globl	SFPU128_$$_MUL32TO64$LONGWORD$LONGWORD$LONGWORD$LONGWORD
SFPU128_$$_MUL32TO64$LONGWORD$LONGWORD$LONGWORD$LONGWORD:
.Lc108:
.seh_proc SFPU128_$$_MUL32TO64$LONGWORD$LONGWORD$LONGWORD$LONGWORD
	pushq	%rbx
.seh_pushreg %rbx
.Lc109:
.seh_endprologue
	movw	%cx,%ax
	shrl	$16,%ecx
	movw	%dx,%r10w
	shrl	$16,%edx
	movzwl	%ax,%r11d
	movzwl	%r10w,%ebx
	imull	%ebx,%r11d
	movzwl	%ax,%eax
	movzwl	%dx,%ebx
	imull	%ebx,%eax
	movl	%ecx,%ebx
	movzwl	%r10w,%r10d
	imull	%r10d,%ebx
	; (Instruction "movzwl %cx,%ecx" gets stripped)
	movzwl	%dx,%edx
	imull	%edx,%ecx
	...

[Cross-platform] Sub-register (field access) optimisation

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Additional notes

Merge request reports