[x86] Code generator now generates SARX under BMI2
Summary
This merge request causes the code generator to generate SARX instructions instead of SAR instructions under BMI2 when the SarInt intrinsics are used.
As an additional commit, BEXTR and BZHI are now checked alongside SARX, SHRX, SARX and RORX when it comes to register spilling, since these instructions follow the same operand format.
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
The compiler tends to mix SAR with SHLX and SHRX.
What is the behavior after applying this patch?
The code generator now favours SARX when compiling under BMI2.
Relevant logs and/or screenshots
Improvements in the code is limited since the SarInt routines are not often used, but there are a few.
In the jwaWindows unit under x86_64-win64, -O4 - before:
.section .text.n_jwawindows_$$_int64shramod32$int64$longword$$int64,"ax"
.balign 16,0x90
.globl JWAWINDOWS_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64
JWAWINDOWS_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64:
.Lc11:
movq %rcx,%rax
movb %dl,%cl
sarq %cl,%rax
.Lc12:
ret
After:
.section .text.n_jwawindows_$$_int64shramod32$int64$longword$$int64,"ax"
.balign 16,0x90
.globl JWAWINDOWS_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64
JWAWINDOWS_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64:
.Lc11:
movzbl %dl,%edx
sarx %rdx,%rcx,%rax
.Lc12:
ret
It's not perfect though. In the calc_divconst_magic_signed
routine in the cgutils unit - before:
...
movl $64,%edx
subq %rax,%rdx
movb %dl,%cl
movq (%rdi),%rax
shlx %rdx,%rax,%rax
sarq %cl,%rax
...
After:
...
movl $64,%edx
subq %rax,%rdx
movzbl %dl,%eax
movq (%rdi),%rcx
shlx %rdx,%rcx,%rdx
sarx %rax,%rdx,%rax
...
These instructions correspond to:
sign_corrective_shift := (SizeOf(aint) * 8) - N;
magic_m := SarInt64(magic_m shl sign_corrective_shift, sign_corrective_shift);
There is an inefficiency because the code generator is forced to zero-extend the selector register because the shift value in the SarInt intrinsics is a Byte. The most efficient code would be:
...
movl $64,%edx
subq %rax,%rdx
shlx %rdx,(%rdi),%rcx
sarx %rdx,%rcx,%rax
...
This can potentially be caught in the x86 code generator because SARX (along with SHLX and SHRX) implicitly performs a bitwise-AND operation on the selector ($1F for the 32-bit version and $3F on the 64-bit version).