J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:intdiv-mulx-update into main Aug 29, 2022

Summary

This merge request makes a minor improvement to !283 (merged) by moving the deallocation of EDX to just before the MOVX instructions. This permits EDX to be used as the destination of the MULX instructions and alleviate register pressure even further.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Once again, the Adler unit shows the most improvement (x86_64-win64, -O4) - before:

.seh_proc ADLER_$$_ADLER32$LONGWORD$PBYTE$LONGWORD$$LONGWORD
	# Register rsp allocated
	pushq	%rbx
.seh_pushreg %rbx
.Lc3:
	pushq	%rsi
.seh_pushreg %rsi
.Lc4:
.seh_endprologue
	...
	# Register ebx allocated
	movl	%r10d,%ebx
	# Register rdx allocated
	movq	$281539415969072,%rdx
	# Register esi allocated
	mulx	%rbx,%rsi,%rsi
	# Register rdx released
	imull	$65521,%esi
	subl	%esi,%ebx
	# Register esi released
	movl	%ebx,%r10d
	# Register ebx released
	# Register ebx allocated
	movl	%ecx,%ebx
	# Register rdx allocated
	movq	$281539415969072,%rdx
	# Register esi allocated
	mulx	%rbx,%rsi,%rsi
	# Register rdx released
	imull	$65521,%esi
	subl	%esi,%ebx
	# Register esi released
	movl	%ebx,%ecx
	# Register ebx released
	...
.Lj3:
	# Register rsi allocated
	popq	%rsi
.Lc5:
	# Register rbx allocated
	popq	%rbx
.Lc6:
	# Register rsp released
	ret

Due to RDX being used as the destination, RSI is not used at all and doesn't have to be preserved with PUSH/POP - after:

	...
	# Register ebx allocated
	movl	%r10d,%ebx
	# Register rdx allocated
	movq	$281539415969072,%rdx
	mulx	%rbx,%rdx,%rdx
	imull	$65521,%edx
	subl	%edx,%ebx
	# Register edx released
	movl	%ebx,%r10d
	# Register ebx released
	# Register ebx allocated
	movl	%ecx,%ebx
	# Register rdx allocated
	movq	$281539415969072,%rdx
	mulx	%rbx,%rdx,%rdx
	imull	$65521,%edx
	subl	%edx,%ebx
	# Register edx released
	movl	%ebx,%ecx
	# Register ebx released
	...
.Lj3:
	# Register rbx allocated
	popq	%rbx
.Lc4:
	# Register rsp released
	ret

In the Classes unit, a MOV gets removed - before:

CLASSES$_$TBINARYOBJECTWRITER_$__$$_WRITECURRENCY$CURRENCY:
	...
	movq	$3777893186295716171,%rdx
	# Register rax allocated
	mulx	%rsi,%rax,%rax
	# Register rsi,rdx released
	shrq	$11,%rax
	# Register rdx allocated
	movq	%rax,%rdx
	# Register rax released
	...

After:

CLASSES$_$TBINARYOBJECTWRITER_$__$$_WRITECURRENCY$CURRENCY:
	...
	movq	$3777893186295716171,%rdx
	mulx	%rsi,%rdx,%rdx
	# Register rsi released
	shrq	$11,%rdx
	...

Unfortunately it's not perfect. In infodwrf, an extra MOV is added, although thanks to multiple ALUs, it shouldn't cause a slowdown - before:

LNFODWRF_$$_GETLINEINFO$QWORD$SHORTSTRING$SHORTSTRING$LONGINT$$BOOLEAN:
	...
	movb	$0,64(%rsp)
	# Register rdx allocated
	movq	$81422607485721415,%rdx
	# Register rax allocated
	mulx	%rcx,%rax,%rax
	# Register rdx released
	# Register rflags allocated
	addq	%rcx,%rax
	rcrq	$1,%rax
	# Register rflags released
	shrq	$10,%rax
	imulq	$2039,%rax
	subq	%rax,%rcx
	# Register rax released
	# Register r15 allocated
	movq	%rcx,%r15
	# Register rdx allocated
	imulq	$528,%rcx,%rdx
	# Register rax allocated
	leaq	U_$LNFODWRF_$$_LINEINFOCACHE(%rip),%rax
	# Register rflags allocated
	cmpq	(%rax,%rdx),%r12 ; <-- %r12 gets set to %rcx at the start of the procedure
	# Register rax,rdx released
	jne	.Lj329
	# Register rflags released
	# Register rdx allocated
	imulq	$528,%rcx,%rdx
	# Register rax allocated
	# Register rcx released
	leaq	U_$LNFODWRF_$$_LINEINFOCACHE(%rip),%rax
	...

After (fixing this one would require somehow hinting the register allocator that it should use %rcx instead of %rax somehow):

LNFODWRF_$$_GETLINEINFO$QWORD$SHORTSTRING$SHORTSTRING$LONGINT$$BOOLEAN:
	...
	movb	$0,64(%rsp)
	# Register rax allocated
	movq	%rcx,%rax ; <-- The extra MOV
	# Register rdx allocated
	movq	$81422607485721415,%rdx
	mulx	%rcx,%rdx,%rdx
	# Register rflags allocated
	addq	%rcx,%rdx
	rcrq	$1,%rdx
	# Register rflags released
	shrq	$10,%rdx
	imulq	$2039,%rdx
	subq	%rdx,%rax
	# Register rdx released
	# Register r15 allocated
	movq	%rax,%r15
	# Register rdx allocated
	imulq	$528,%rax,%rdx
	leaq	U_$LNFODWRF_$$_LINEINFOCACHE(%rip),%rax
	# Register rflags allocated
	cmpq	(%rax,%rdx),%rcx ; <-- Here, DeepMOVOpt changes %r12 to %rcx because %rcx hasn't changed value
	# Register rax,rdx,rcx released
	jne	.Lj329
	# Register rflags released
	# Register rdx allocated
	imulq	$528,%r15,%rdx
	# Register rax allocated
	leaq	U_$LNFODWRF_$$_LINEINFOCACHE(%rip),%rax
	...

Additional notes

Because of the reuse of RDX, the improved optimisation in !285 is unable to remove the repeated constant. It might be possible to do some advanced peephole optimisations that take place before register allocation though so RDX gets preserved.

Additionally, there are situations where a register is copied, worked on and its result copied back into the original register; in this case we might as well just work with the original register (e.g. in the Adler example, both %r10d and %ecx are copied into %ebx in turn, and then after %ebx has been transmuted, the results are written back into %r10d and %ecx respectively. The peephole optimizer is not (yet) able to optimise this via DeepMOVOpt because, for all it knows, the upper half of the 64-bit versions, %r10 and %rcx, may be non-zero (they're not in this case) and hence cause the '''MULX''' instructions to return different results. This requires additional information that the peephole optimizer can access, or better node-level optimisation, to alleviate.

[x86] Minor improvement to BMI2-based integer division