[x86] "x and ((1 shl y) - 1)" now uses BZHI
Summary
This merge request takes advantage of BMI2 (if it's enabled) to convert bitmasking sequences of the form x and ((1 shl y) - 1)
(where y
is a variable) into BZHI instructions at the node level.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
When compiling with -CpCOREAVX2
, improvements should be made in some situations where expressions of the form x and ((1 shl y) - 1)
, reducing the cycle count.
Additional notes
The merge request is split up into four distinct commits:
- The first introduces a new
emit_reg_ref_reg
routine, along with supporting methods, to permit the node-level generation of a BZHI instruction that has operands in this order. - The second converts an "and" node and a suitable subtree into a BZHI instruction.
- Sometimes, the node's first pass will convert an "and" node into an "inline" node that calls
in_and-assign_x_y
. The third commit looks out for this sequence as well in an attempt to create aBZHI
instruction. - The fourth contains 13 new tests that aim to test each part of the code generator that creates
BZHI
instructions in all of its forms, including ones where the shift operand is undersized and oversized.
Since the code additions share space with the previous ANDN optimisation (!305 (merged)), the if-blocks were restructured since this optimisation shares some of them. However, at the same time, a couple of internal error number clashes were fixed.
Relevant logs and/or screenshots
Parts of hlcgobj
get some different registers allocated, but the main improvement is as follows - before:
...
.Lj644:
...
call *280(%rax)
leaq (,%rax,8),%rdx
movl $1,%eax
shlx %rdx,%rax,%rax
subq $1,%rax
andq %rsi,%rax
...
After:
...
.Lj644:
...
call *280(%rax)
shlq $3,%rax
bzhi %rax,%rsi,%rax
...
Like with hlcgobj
, the fpreadtiff
unit gets different registers allocated centred around this improvement - before:
...
.Lj982:
movzbl -60(%rdx),%r8d
movl $1,%r9d
shlx %r8d,%r9d,%r8d
subl $1,%r8d
andl -52(%rdx),%r8d
movw %r8w,%ax
...
After:
...
.Lj982:
movzbl -60(%rdx),%r8d
bzhi %r8d,-52(%rdx),%ecx
movw %cx,%ax
...
For the inline-node optimisation, outside of the tests that seek to trigger it, the jcphuff
unit of the PasJPEG package gains an improvement - before:
...
.Lj43:
cmpb $0,24(%rbx)
jne .Lj40
movslq %edi,%rax
movl $1,%edx
shlx %eax,%edx,%eax
subl $1,%eax
andl %eax,%esi
addl %edi,%r12d
...
After:
...
.Lj43:
cmpb $0,24(%rbx)
jne .Lj40
bzhi %edi,%esi,%esi
addl %edi,%r12d
...