[x86] New SHR peephole optimizations
Summary
This merge request contains two new but related peephole optimizations for the x86 family of processors:
- If a SHR/AND combination is found, both with constants, and the AND mask covers all of the possible 1-bits (e.g.
shrl $24,%eax; andl $255,%eax
), then the AND instruction is removed. - If a SHR/MOVZX combiation is found, and the shift makes the zero-extension unnecessary (e.g.
shrl %24,%eax; movzbl %al,%eax
), then the MOVZX instruction is removed or changed into a regular MOV depending on if the destination register is the same superclass as the source register or not. - Additional code tries to permit these optimisations to be made across jumps or multiple SHR and even SHL instructions.
- As an addition, if a
shr %cl,%reg
instruction is followed by ashr const,%reg
(writing to the same register as long as it's not in the %ecx class), the two instructions are swapped to help minimise a pipeline stall should %cl get written to immediately beforehand.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Compiled programs should see improvement in both speed and size, including the compiler itself since commonly-called methods like TUsedRegs.Update
receive optimisations.
Additional notes
Most of the optimisations are done in the post-peephole stage once all other optimisations are performed, especially as Pass 2 contains some extensive ones centred around MOVZX instructions. However, a couple of optimisations are also performed in Pass 1 (converting a MOVZX to a MOV instruction, and merging two SHR instructions together) since these can open up other optimisations sometimes.
Relevant logs and/or screenshots
Quite a large number of files receive improvements. Besides the standard removal of AND instructions to save a cycle and a few bytes of code size (removing the AND instruction from shrl $24,%edx; andl $255,%edx
is a common occurrance), some combinations occur. On a simple level, for aasmcpu under x86_64-win64, -O4 - before:
...
movl %ecx,%edi
shrl $24,%edi
andl $255,%edi
movzbl %dil,%eax
...
.Lj1001:
After:
...
movl %ecx,%edi
shrl $24,%edi
movl %edi,%eax
...
.Lj1001:
On older CPUs where MOVZX multiple cycles to execute, this one has reduced the cycle count from 5 to 3, and there is potential to reduce it further with development of BMI-based optimisations that specialise in bit extraction.
cgobj demonstrates a good cross-jump example - before:
.seh_proc CGOBJ$_$TCG_$__$$_ADD_REG_INSTRUCTION$TAI$TREGISTER
...
movq %rcx,%rax
movl %r8d,%ecx
shrl $24,%ecx
andl $255,%ecx
movzbl %cl,%r9d
cmpq $0,8(%rax,%r9,8)
je .Lj83
movzbl %cl,%ecx
movq 8(%rax,%rcx,8),%rcx
After:
.seh_proc CGOBJ$_$TCG_$__$$_ADD_REG_INSTRUCTION$TAI$TREGISTER
...
movq %rcx,%rax
movl %r8d,%ecx
shrl $24,%ecx
movl %ecx,%r9d
cmpq $0,8(%rax,%r9,8)
je .Lj83
movq 8(%rax,%rcx,8),%rcx
This will benefit further from optimisation hints that tell the compiler which 64-bit registers have a zero upper half, since %r9 can be replaced with %rcx in the CMP instruction's reference (see !74 )
In the chmfiftimain unit, the SHR merging is coupled with the shr %cl,%reg; shr x,%reg
optimisation - before:
.Lj237:
...
movzbl -20(%rbx),%eax
movl %esi,%edx
movl %eax,%ecx
shrl %cl,%edx
shrl $16,%edx
shrw $8,%dx
After:
.Lj237:
...
movzbl -20(%rbx),%eax
movl %esi,%edx
movl %eax,%ecx
shrl $24,%edx
shrl %cl,%edx
It's hard to say exactly how the processor will schedule the instructions, but if movzbl -20(%rbx),%eax
and movl %esi,%edx
execute in parallel, then the final swap will permit shrl $24,%edx
to execute at the same time as movl %eax,%ecx
, and then shrl %cl,%edx
will execute once those instructions have finished, totalling 3 cycles (not counting the latency from reading from -20(%rbx)
). On the other hand, if shrl %cl,%edx
is forced to execute first, there's a pipeline stall due to movl %eax,%ecx
and so it takes 4 cycles instead.