[x86] New SHR peephole optimizations (!278) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:shr-and into main Aug 16, 2022

Summary

This merge request contains two new but related peephole optimizations for the x86 family of processors:

If a SHR/AND combination is found, both with constants, and the AND mask covers all of the possible 1-bits (e.g. shrl $24,%eax; andl $255,%eax), then the AND instruction is removed.
If a SHR/MOVZX combiation is found, and the shift makes the zero-extension unnecessary (e.g. shrl %24,%eax; movzbl %al,%eax), then the MOVZX instruction is removed or changed into a regular MOV depending on if the destination register is the same superclass as the source register or not.
Additional code tries to permit these optimisations to be made across jumps or multiple SHR and even SHL instructions.
As an addition, if a shr %cl,%reg instruction is followed by a shr const,%reg (writing to the same register as long as it's not in the %ecx class), the two instructions are swapped to help minimise a pipeline stall should %cl get written to immediately beforehand.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Compiled programs should see improvement in both speed and size, including the compiler itself since commonly-called methods like TUsedRegs.Update receive optimisations.

Additional notes

Most of the optimisations are done in the post-peephole stage once all other optimisations are performed, especially as Pass 2 contains some extensive ones centred around MOVZX instructions. However, a couple of optimisations are also performed in Pass 1 (converting a MOVZX to a MOV instruction, and merging two SHR instructions together) since these can open up other optimisations sometimes.

Relevant logs and/or screenshots

Quite a large number of files receive improvements. Besides the standard removal of AND instructions to save a cycle and a few bytes of code size (removing the AND instruction from shrl $24,%edx; andl $255,%edx is a common occurrance), some combinations occur. On a simple level, for aasmcpu under x86_64-win64, -O4 - before:

	...
	movl	%ecx,%edi
	shrl	$24,%edi
	andl	$255,%edi
	movzbl	%dil,%eax
	...
.Lj1001:

After:

	...
	movl	%ecx,%edi
	shrl	$24,%edi
	movl	%edi,%eax
	...
.Lj1001:

On older CPUs where MOVZX multiple cycles to execute, this one has reduced the cycle count from 5 to 3, and there is potential to reduce it further with development of BMI-based optimisations that specialise in bit extraction.

cgobj demonstrates a good cross-jump example - before:

.seh_proc CGOBJ$_$TCG_$__$$_ADD_REG_INSTRUCTION$TAI$TREGISTER
	...
	movq	%rcx,%rax
	movl	%r8d,%ecx
	shrl	$24,%ecx
	andl	$255,%ecx
	movzbl	%cl,%r9d
	cmpq	$0,8(%rax,%r9,8)
	je	.Lj83
	movzbl	%cl,%ecx
	movq	8(%rax,%rcx,8),%rcx

After:

.seh_proc CGOBJ$_$TCG_$__$$_ADD_REG_INSTRUCTION$TAI$TREGISTER
	...
	movq	%rcx,%rax
	movl	%r8d,%ecx
	shrl	$24,%ecx
	movl	%ecx,%r9d
	cmpq	$0,8(%rax,%r9,8)
	je	.Lj83
	movq	8(%rax,%rcx,8),%rcx

This will benefit further from optimisation hints that tell the compiler which 64-bit registers have a zero upper half, since %r9 can be replaced with %rcx in the CMP instruction's reference (see !74 )

In the chmfiftimain unit, the SHR merging is coupled with the shr %cl,%reg; shr x,%reg optimisation - before:

.Lj237:
	...
	movzbl	-20(%rbx),%eax
	movl	%esi,%edx
	movl	%eax,%ecx
	shrl	%cl,%edx
	shrl	$16,%edx
	shrw	$8,%dx

After:

.Lj237:
	...
	movzbl	-20(%rbx),%eax
	movl	%esi,%edx
	movl	%eax,%ecx
	shrl	$24,%edx
	shrl	%cl,%edx

It's hard to say exactly how the processor will schedule the instructions, but if movzbl -20(%rbx),%eax and movl %esi,%edx execute in parallel, then the final swap will permit shrl $24,%edx to execute at the same time as movl %eax,%ecx, and then shrl %cl,%edx will execute once those instructions have finished, totalling 3 cycles (not counting the latency from reading from -20(%rbx)). On the other hand, if shrl %cl,%edx is forced to execute first, there's a pipeline stall due to movl %eax,%ecx and so it takes 4 cycles instead.

[x86] New SHR peephole optimizations

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Additional notes

Relevant logs and/or screenshots

Merge request reports