J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:func-mov into main Oct 27, 2022

Summary

This merge request contains three commits related to extending the FuncMov2Func optimisation:

FuncMov2Func now works for 1, 3 and 4-operand instructions where the last operand is a pure write and the others are all pure reads.
FuncMov2Func has been moved to a separate procedure so it can also be called in Pass 2 since other Pass 2 optimisations can open up new opportunities.
Due to being called in Pass 2, the FuncMov2Func optimisation now checks to see if the end result is mov %reg,%reg, which can happen if the function is another MOV instruction, since this gets missed in Pass 2 otherwise.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

FuncMov2Func now optimises for 1, 3 and 4-operand instructions which are immediately followed by a MOV instruction.

Relevant logs and/or screenshots

Under x86_64-win64, -O4, in the defutils unit, FuncMov2Mov gets called twice in a row to collapse a dependency chain - before:

	...
.Lj925:
	leaq	TC_$CGBASE$_$INT_CGSIZE$INT64$$TCGSIZE_$$_SIZE2CGSIZE(%rip),%r8
	movb	(%r8,%rax,1),%dl
	movb	%dl,%cl
	movb	%cl,%bl
	jmp	.Lj836
	...

After:

	...
.Lj925:
	leaq	TC_$CGBASE$_$INT_CGSIZE$INT64$$TCGSIZE_$$_SIZE2CGSIZE(%rip),%r8
	movb	(%r8,%rax,1),%bl
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
	jmp	.Lj836
	...

In the System unit - before:

.section .text.n_system_$$_align$qword$qword$$qword,"ax"
	.balign 16,0x90
.globl	SYSTEM_$$_ALIGN$QWORD$QWORD$$QWORD
SYSTEM_$$_ALIGN$QWORD$QWORD$$QWORD:
.Lc162:
	movq	%rdx,%r8
	leaq	-1(%rdx),%rax
	leaq	(%rcx,%rax),%r9
	andq	%rax,%rdx
	jne	.Lj191
	andn	%r9,%rax,%rax
	movq	%rax,%rcx
# Peephole Optimization: Duplicated 1 assignment(s) and redirected jump
# Peephole Optimization: %rcx = %rax; changed to minimise pipeline stall (MovXXX2MovXXX)
# Peephole Optimization: Mov2Nop 4 done
	ret
	.p2align 4,,10
	.p2align 3
.Lj191:
	movq	%r9,%rax
	xorl	%edx,%edx
	divq	%r8
	subq	%rdx,%r9
	movq	%r9,%rcx
	movq	%rcx,%rax
.Lc163:
	ret

After:

.section .text.n_system_$$_align$qword$qword$$qword,"ax"
	.balign 16,0x90
.globl	SYSTEM_$$_ALIGN$QWORD$QWORD$$QWORD
SYSTEM_$$_ALIGN$QWORD$QWORD$$QWORD:
.Lc162:
	movq	%rdx,%r8
	leaq	-1(%rdx),%rax
	leaq	(%rcx,%rax),%r9
	andq	%rax,%rdx
	jne	.Lj191
	andn	%r9,%rax,%rcx
	movq	%rcx,%rax ; <-- Registers have been swapped, but this is not yet optimised into just "andn %r9,%rax,%rax" because %rcx is still tracked as 'in use'.  A future merge request aims to correct this.
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
# Peephole Optimization: Duplicated 1 assignment(s) and redirected jump
	ret
	.p2align 4,,10
	.p2align 3
.Lj191:
	movq	%r9,%rax
	xorl	%edx,%edx
	divq	%r8
	subq	%rdx,%r9
	movq	%r9,%rax
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
.Lc163:
	ret

Additional notes

Sometimes, FuncMov2Func now does the work of SETcc/MOV -> SETcc, although there are still cases where the latter is performed instead.

Some new inefficiencies were identified thanks to this improved optimisation - for example, in the SysUtils unit:

.Lj7599:
	...
	movb	%sil,%dil
	movb	-4(%rbp),%dil
	movb	-4(%rbp),%sil

That first MOV instruction could be removed.

In unicodedata - before (unrelated debug messages removed for clarity):

	...
.Lj964:
	movzwl	(%r12),%ecx
	leaq	(%r12,%rcx),%rdx
	movq	%rdx,%r12
	jmp	.Lj957
	.balign 16,0x90
	...

After:

	...
.Lj964:
	movzwl	(%r12),%ecx
	leaq	(%r12,%rcx),%r12
# Peephole Optimization: Removed MOV and changed destination on previous instruction to optimise register usage (FuncMov2Func)
	jmp	.Lj957
	.balign 16,0x90
	...

In this situation, another run of Pass 2 would convert the LEA instruction into add %rcx,%r12.

[x86] Extensions to FuncMov2Func optimisation

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Additional notes

Merge request reports