[x86] Unsigned integer division upgrade to use BMI2 if possible
Summary
This merge request upgrades the multiplicative reciprocal algorithm for unsigned integer division to use MULX (a BMI2 instruction) if permitted to do so (use the -CpCOREAVX2
compiler option). The speed gain is negligable, but it reduces register pressure since EAX doesn't have to be reserved and EDX is only read from, hence other peephole optimisations may be possible.
Additionally, for 64-bit unsigned mod
operations, the IMUL instruction used to calculate the remainder uses an immediate rather than a register as one of the operands if possible (as long as the constant is between 0 and $7FFFFFFF).
System
- Processor architecture: i386, x86_64
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Unsigned div
and mod
operations with constants will use BMI2 and see performance improvements in limited circumstances.
Relevant logs and/or screenshots
For the IMUL improvement (-CpCOREAVX2
has not been specified here), this is showcased best in the base64 unit (x86_64-win64, -O4) - the code corresponds to X mod 3
- before:
.globl BASE64$_$TBASE64ENCODINGSTREAM_$__$$_FLUSH$$BOOLEAN
BASE64$_$TBASE64ENCODINGSTREAM_$__$$_FLUSH$$BOOLEAN:
...
movl 24(%rcx),%ecx
movq $-6148914691236517205,%rax
mulq %rcx
shrq $1,%rdx
movl $3,%eax
imulq %rax,%rdx
subq %rdx,%rcx
...
What becomes imulq $3,%rdx
gets converted into a faster variant: leaq (%rdx,%rdx,2),%rdx
- after:
.globl BASE64$_$TBASE64ENCODINGSTREAM_$__$$_FLUSH$$BOOLEAN
BASE64$_$TBASE64ENCODINGSTREAM_$__$$_FLUSH$$BOOLEAN:
...
movl 24(%rcx),%ecx
movq $-6148914691236517205,%rax
mulq %rcx
shrq $1,%rdx
leaq (%rdx,%rdx,2),%rdx
subq %rdx,%rcx
...
Under -CpCOREAVX2
the adler unit shows a fair few improvements - before:
...
.Lj14:
...
movl %r10d,%ebx
movq $281539415969072,%rax
mulq %rbx
imull $65521,%edx
subl %edx,%ebx
movl %ebx,%r10d
movl %ecx,%ebx
movq $281539415969072,%rax
mulq %rbx
imull $65521,%edx
subl %edx,%ebx
movl %ebx,%ecx
...
After:
...
.Lj14:
...
movl %r10d,%ebx
movq $281539415969072,%rdx
mulx %rbx,%rsi,%rsi
imull $65521,%esi
subl %esi,%ebx
movl %ebx,%r10d
movl %ecx,%ebx
movq $281539415969072,%rdx
mulx %rbx,%rsi,%rsi
imull $65521,%esi
subl %esi,%ebx
movl %ebx,%ecx
...
A separate peephole optimisation is in development that can merge the two identical constants in %rdx. Within the same procedure, because %eax hasn't been reserved (MUL puts the lower half of the result into it), it can be used directly to store the function result, whereas before it was stored in %esi and only written to %eax right before the function exits, thus removing MOV instructions elsewhere.
Additional notes
The nice thing about MULX is that if the two destination operands are set to the same register, it will write the upper half of the result to it, discarding the lower half that is not required for the multiplicative reciprocal algorithm.