Skip to content

[x86] New SHR peephole optimizations

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:shr-and into main

Summary

This merge request contains two new but related peephole optimizations for the x86 family of processors:

  • If a SHR/AND combination is found, both with constants, and the AND mask covers all of the possible 1-bits (e.g. shrl $24,%eax; andl $255,%eax), then the AND instruction is removed.
  • If a SHR/MOVZX combiation is found, and the shift makes the zero-extension unnecessary (e.g. shrl %24,%eax; movzbl %al,%eax), then the MOVZX instruction is removed or changed into a regular MOV depending on if the destination register is the same superclass as the source register or not.
  • Additional code tries to permit these optimisations to be made across jumps or multiple SHR and even SHL instructions.
  • As an addition, if a shr %cl,%reg instruction is followed by a shr const,%reg (writing to the same register as long as it's not in the %ecx class), the two instructions are swapped to help minimise a pipeline stall should %cl get written to immediately beforehand.

System

  • Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Compiled programs should see improvement in both speed and size, including the compiler itself since commonly-called methods like TUsedRegs.Update receive optimisations.

Additional notes

Most of the optimisations are done in the post-peephole stage once all other optimisations are performed, especially as Pass 2 contains some extensive ones centred around MOVZX instructions. However, a couple of optimisations are also performed in Pass 1 (converting a MOVZX to a MOV instruction, and merging two SHR instructions together) since these can open up other optimisations sometimes.

Relevant logs and/or screenshots

Quite a large number of files receive improvements. Besides the standard removal of AND instructions to save a cycle and a few bytes of code size (removing the AND instruction from shrl $24,%edx; andl $255,%edx is a common occurrance), some combinations occur. On a simple level, for aasmcpu under x86_64-win64, -O4 - before:

	...
	movl	%ecx,%edi
	shrl	$24,%edi
	andl	$255,%edi
	movzbl	%dil,%eax
	...
.Lj1001:

After:

	...
	movl	%ecx,%edi
	shrl	$24,%edi
	movl	%edi,%eax
	...
.Lj1001:

On older CPUs where MOVZX multiple cycles to execute, this one has reduced the cycle count from 5 to 3, and there is potential to reduce it further with development of BMI-based optimisations that specialise in bit extraction.

cgobj demonstrates a good cross-jump example - before:

.seh_proc CGOBJ$_$TCG_$__$$_ADD_REG_INSTRUCTION$TAI$TREGISTER
	...
	movq	%rcx,%rax
	movl	%r8d,%ecx
	shrl	$24,%ecx
	andl	$255,%ecx
	movzbl	%cl,%r9d
	cmpq	$0,8(%rax,%r9,8)
	je	.Lj83
	movzbl	%cl,%ecx
	movq	8(%rax,%rcx,8),%rcx

After:

.seh_proc CGOBJ$_$TCG_$__$$_ADD_REG_INSTRUCTION$TAI$TREGISTER
	...
	movq	%rcx,%rax
	movl	%r8d,%ecx
	shrl	$24,%ecx
	movl	%ecx,%r9d
	cmpq	$0,8(%rax,%r9,8)
	je	.Lj83
	movq	8(%rax,%rcx,8),%rcx

This will benefit further from optimisation hints that tell the compiler which 64-bit registers have a zero upper half, since %r9 can be replaced with %rcx in the CMP instruction's reference (see !74 )

In the chmfiftimain unit, the SHR merging is coupled with the shr %cl,%reg; shr x,%reg optimisation - before:

.Lj237:
	...
	movzbl	-20(%rbx),%eax
	movl	%esi,%edx
	movl	%eax,%ecx
	shrl	%cl,%edx
	shrl	$16,%edx
	shrw	$8,%dx

After:

.Lj237:
	...
	movzbl	-20(%rbx),%eax
	movl	%esi,%edx
	movl	%eax,%ecx
	shrl	$24,%edx
	shrl	%cl,%edx

It's hard to say exactly how the processor will schedule the instructions, but if movzbl -20(%rbx),%eax and movl %esi,%edx execute in parallel, then the final swap will permit shrl $24,%edx to execute at the same time as movl %eax,%ecx, and then shrl %cl,%edx will execute once those instructions have finished, totalling 3 cycles (not counting the latency from reading from -20(%rbx)). On the other hand, if shrl %cl,%edx is forced to execute first, there's a pipeline stall due to movl %eax,%ecx and so it takes 4 cycles instead.

Merge request reports