J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:block-move into main Sep 17, 2021

This optimisation, besides introducing new methods for finding and allocating a free register for the peephole optimizer to use, seeks to optimise memory-to-memory moves that use a 64-bit general purpose register. It does this by looking for a pair of QWord moves with an offset difference of 8 and changing them to use a free XMM register instead (using movdqu for an integer move, or movdqa if the memory block uses %rbp with an offset that's a multiple of 16).

Criteria

Confirm correct compilation along with speed gain and size reduction in third-party binaries.

Notes

Example from the Math unit - trunk:

	...
.section .text.n_math_$$_momentskewkurtosis$array_of_single$double$double$double$double$double$double,"ax"
	.balign 16,0x90
.globl	MATH_$$_MOMENTSKEWKURTOSIS$array_of_SINGLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE
MATH_$$_MOMENTSKEWKURTOSIS$array_of_SINGLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE:
.seh_proc MATH_$$_MOMENTSKEWKURTOSIS$array_of_SINGLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE
	pushq	%rbp
.seh_pushreg %rbp
	movq	%rsp,%rbp
	leaq	-64(%rsp),%rsp
.seh_stackalloc 64
.seh_endprologue
	movq	%rcx,%rax
	movq	72(%rbp),%rcx
	movq	%rcx,56(%rsp)
	movq	64(%rbp),%rcx
	movq	%rcx,48(%rsp)
	movq	56(%rbp),%rcx
	movq	%rcx,40(%rsp)
	movq	48(%rbp),%rcx
	movq	%rcx,32(%rsp)
	movq	%rax,%rcx
# Peephole Optimization: Lea2Add done
	addq	$1,%rdx
	call	MATH_$$_MOMENTSKEWKURTOSIS$PSINGLE$LONGINT$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE
	nop
	leaq	(%rbp),%rsp
	popq	%rbp
	ret
.seh_endproc
	...

This gets optimised to:

	...
.section .text.n_math_$$_momentskewkurtosis$array_of_single$double$double$double$double$double$double,"ax"
	.balign 16,0x90
.globl	MATH_$$_MOMENTSKEWKURTOSIS$array_of_SINGLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE
MATH_$$_MOMENTSKEWKURTOSIS$array_of_SINGLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE:
.seh_proc MATH_$$_MOMENTSKEWKURTOSIS$array_of_SINGLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE
	pushq	%rbp
.seh_pushreg %rbp
	movq	%rsp,%rbp
	leaq	-64(%rsp),%rsp
.seh_stackalloc 64
.seh_endprologue
# Peephole Optimization: MovMov2NopNop 6b done
# Peephole Optimization: Used %xmm0 to merge a pair of memory moves (MovMovMovMov2MovdqMovdq 2)
	movdqa	64(%rbp),%xmm0
	movdqu	%xmm0,48(%rsp)
# Peephole Optimization: Used %xmm0 to merge a pair of memory moves (MovMovMovMov2MovdqMovdq 2)
	movdqa	48(%rbp),%xmm0
	movdqu	%xmm0,32(%rsp)
# Peephole Optimization: Lea2Add done
	addq	$1,%rdx
	call	MATH_$$_MOMENTSKEWKURTOSIS$PSINGLE$LONGINT$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE$DOUBLE
	nop
	leaq	(%rbp),%rsp
	popq	%rbp
	ret
.seh_endproc
	...

Since %rcx is no longer used for transferring the data, the "movq %rcx,%rax" and "movq %rax,%rcx" instructions that preserve its value can also be removed (I don't know why the compiler didn't use %rax for the data transfer though, and left %rcx alone).

Future Development

There is room for improvement. Firstly, if the peephole optimizer can confirm that the stack is aligned (it currently can't assume it because if a leaf function is compiled without a stack frame, %rsp is offset by 8), then movdqa instructions can be used when writing to the stack sometimes. Also, the optimisations are highly dependent on the order of instructions - for example, in the Classes unit:

	...
	je	.Lj5719
	movq	80(%rbp),%rcx
	movq	%rcx,64(%rsp)
	movq	72(%rbp),%rcx
	movq	%rcx,56(%rsp)
	movq	64(%rbp),%rcx
	movq	%rcx,48(%rsp)
	movw	48(%rbp),%cx
	movw	%cx,32(%rsp)
	movq	56(%rbp),%rcx
	movq	%rcx,40(%rsp)
	...

This gets optimised to:

	je	.Lj5719
# Peephole Optimization: Used %xmm0 to merge a pair of memory moves (MovMovMovMov2MovdqMovdq 2)
	movdqu	72(%rbp),%xmm0
	movdqu	%xmm0,56(%rsp)
	movq	64(%rbp),%rcx
	movq	%rcx,48(%rsp)
	movw	48(%rbp),%cx
	movw	%cx,32(%rsp)
	movq	56(%rbp),%rcx
	movq	%rcx,40(%rsp)

If instead the read from 64(%rbp) and 72(%rbp) were optimised, then movdqa could be used over movdqu. Additionally, while these operations relate to moving a record from one area of memory to another, it still goes by the exact size of the fields, even though each one is aligned to an 8-byte boundary and, in this case, it doesn't matter what is stored between 50(%rbp)/34(%rsp) and 55(%rbp)/39(%rsp) inclusive. If, at another compiler stage, this is changed to a full 64-bit read and write (there's no penalty on x64 - in fact it might even be faster because it's the exact word size), then the xmm read and write optimisation can be applied again.

Depending on the number of registers available, it's possible that it's faster to reorder the reads and writes so everything is read at once and in ascending order. For example, using the Math example above:

	movdqa	64(%rbp),%xmm0
	movdqu	%xmm0,48(%rsp)
	movdqa	48(%rbp),%xmm0
	movdqu	%xmm0,32(%rsp)

Grouping all the reads together and using another XMM register:

	movdqa	48(%rbp),%xmm0
	movdqa	64(%rbp),%xmm1
	movdqu	%xmm0,32(%rsp)
	movdqu	%xmm1,48(%rsp)

This would minimise pipeline stalls and cache misses by taking advantage of the fact that more memory than necessary (read-ahead) is passed into the CPU cache rather than what's just immediately requested, but often not reading backwards.

Finally, currently only XMM registers are used. If the selected CPU supports it, there's no reason why this couldn't be extended to read and write YMM or ZMM registers if the memory block is large enough.

x86_64: XMM-based block move optimisation

Criteria

Notes

Future Development

Merge request reports