J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:intdiv-reciprocal-mulx into main Aug 20, 2022

Summary

This merge request upgrades the multiplicative reciprocal algorithm for unsigned integer division to use MULX (a BMI2 instruction) if permitted to do so (use the -CpCOREAVX2 compiler option). The speed gain is negligable, but it reduces register pressure since EAX doesn't have to be reserved and EDX is only read from, hence other peephole optimisations may be possible.

Additionally, for 64-bit unsigned mod operations, the IMUL instruction used to calculate the remainder uses an immediate rather than a register as one of the operands if possible (as long as the constant is between 0 and $7FFFFFFF).

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Unsigned div and mod operations with constants will use BMI2 and see performance improvements in limited circumstances.

Relevant logs and/or screenshots

For the IMUL improvement (-CpCOREAVX2 has not been specified here), this is showcased best in the base64 unit (x86_64-win64, -O4) - the code corresponds to X mod 3 - before:

.globl	BASE64$_$TBASE64ENCODINGSTREAM_$__$$_FLUSH$$BOOLEAN
BASE64$_$TBASE64ENCODINGSTREAM_$__$$_FLUSH$$BOOLEAN:
	...
	movl	24(%rcx),%ecx
	movq	$-6148914691236517205,%rax
	mulq	%rcx
	shrq	$1,%rdx
	movl	$3,%eax
	imulq	%rax,%rdx
	subq	%rdx,%rcx
	...

What becomes imulq $3,%rdx gets converted into a faster variant: leaq (%rdx,%rdx,2),%rdx - after:

.globl	BASE64$_$TBASE64ENCODINGSTREAM_$__$$_FLUSH$$BOOLEAN
BASE64$_$TBASE64ENCODINGSTREAM_$__$$_FLUSH$$BOOLEAN:
	...
	movl	24(%rcx),%ecx
	movq	$-6148914691236517205,%rax
	mulq	%rcx
	shrq	$1,%rdx
	leaq	(%rdx,%rdx,2),%rdx
	subq	%rdx,%rcx
	...

Under -CpCOREAVX2 the adler unit shows a fair few improvements - before:

	...
.Lj14:
	...
	movl	%r10d,%ebx
	movq	$281539415969072,%rax
	mulq	%rbx
	imull	$65521,%edx
	subl	%edx,%ebx
	movl	%ebx,%r10d
	movl	%ecx,%ebx
	movq	$281539415969072,%rax
	mulq	%rbx
	imull	$65521,%edx
	subl	%edx,%ebx
	movl	%ebx,%ecx
	...

After:

	...
.Lj14:
	...
	movl	%r10d,%ebx
	movq	$281539415969072,%rdx
	mulx	%rbx,%rsi,%rsi
	imull	$65521,%esi
	subl	%esi,%ebx
	movl	%ebx,%r10d
	movl	%ecx,%ebx
	movq	$281539415969072,%rdx
	mulx	%rbx,%rsi,%rsi
	imull	$65521,%esi
	subl	%esi,%ebx
	movl	%ebx,%ecx
	...

A separate peephole optimisation is in development that can merge the two identical constants in %rdx. Within the same procedure, because %eax hasn't been reserved (MUL puts the lower half of the result into it), it can be used directly to store the function result, whereas before it was stored in %esi and only written to %eax right before the function exits, thus removing MOV instructions elsewhere.

Additional notes

The nice thing about MULX is that if the two destination operands are set to the same register, it will write the upper half of the result to it, discarding the lower half that is not required for the multiplicative reciprocal algorithm.

[x86] Unsigned integer division upgrade to use BMI2 if possible