[x86] Fixed inaccuracy in the long-range MOV optimisations (!517) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:deepmov-inefficiency into main Oct 29, 2023

Summary

This merge request fixes a minor inconsistency in some of the long-range deep optimisations performed by OptPass1MOV. Previously, if an operand of a distant MOV instruction got changed to minimise a pipeline stall (even if it's too far away to have any real benefit in that regard), the process would prematurely drop out and miss other potential optimisations, such as potentially simplifying conditional jumps and removing redundant assignments.

A second commit moves the MovMov2MovMov 2 (changes "mov [ref],reg1; mov [ref],reg2" into "mov [ref],reg1; mov reg1,reg2") into a separate routine so it can be called during the long-range deep optimisations. This is to catch a one-off inefficiency where if the first MOV in the pair contains the source register in the reference and reg1 is the target register, the end result is less optimal code.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

Some potential optimisations under -O3 get missed.

What is the behavior after applying this patch?

More code gets successfully optimised.

Relevant logs and/or screenshots

Most units receive a change but are largely just a change of register to minimise the chance of a pipeline stall. In some cases though, it permits another optimisation to take place.

Under x86_64-win64 -O4 in the ServiceManager unit - before:

	...
.section .text.n_servicemanager_$$_allocdependencylist$ansistring$$pchar,"ax"
	...
.Lc120:
.seh_stackalloc 40
.seh_endprologue
	movq	%rcx,%rbx
	movq	$0,32(%rsp)
	testq	%rcx,%rcx
	je	.Lj154
	movq	%rcx,%rsi
	testq	%rbx,%rbx
	je	.Lj155
	movq	-8(%rbx),%rsi
.Lj155:
	movslq	%esi,%rax
	...

With the inefficiency fixed, the second TEST instruction gets optimised and this caues a cascade that removes the conditional jump since it is guaranteed to be false (since if %rcx = 0, the control flow will jump to .Lj154 - after:

	...
.section .text.n_servicemanager_$$_allocdependencylist$ansistring$$pchar,"ax"
	...
.Lc120:
.seh_stackalloc 40
.seh_endprologue
	movq	%rcx,%rbx
	movq	$0,32(%rsp)
	testq	%rcx,%rcx
	je	.Lj154
	movq	-8(%rcx),%rsi
	movslq	%esi,%rax
	...

The system unit has a couple of situations where redundant writes are removed - before:

.section .text.n_fpc_widestr_compare,"ax"
	...
	movq	%rcx,%rbx
	movq	%rdx,%rsi
	...
.Lj3275:
	movq	%rcx,%r8
	testq	%rbx,%rbx
	je	.Lj3276
	movl	-4(%rbx),%r8d
	shrq	$1,%r8
.Lj3276:
	movq	%rdx,%rax
	testq	%rsi,%rsi
	je	.Lj3277
	movl	-4(%rsi),%eax
	shrq	$1,%rax
.Lj3277:
	cmpq	%r8,%rax
	cmovlq	%rax,%r8
	movq	%rsi,%rdx
	movq	%rbx,%rcx
	call	SYSTEM_$$_COMPAREWORD$formal$formal$INT64$$INT64
	...

In this case, besides the change of register from %rbx to %rcx in many instructions, the two MOV instructions immediately prior to the CALL get removed since the registers contain the same values - after:

.section .text.n_fpc_widestr_compare,"ax"
	...
	movq	%rcx,%rbx
	movq	%rdx,%rsi
	...
.Lj3275:
	movq	%rcx,%r8
	testq	%rcx,%rcx
	je	.Lj3276
	movl	-4(%rcx),%r8d
	shrq	$1,%r8
.Lj3276:
	movq	%rdx,%rax
	testq	%rdx,%rdx
	je	.Lj3277
	movl	-4(%rdx),%eax
	shrq	$1,%rax
.Lj3277:
	cmpq	%r8,%rax
	cmovlq	%rax,%r8
	call	SYSTEM_$$_COMPAREWORD$formal$formal$INT64$$INT64
	...

The Strutils unit receives an optimisation similar to the aforementioned ServiceManager unit - before:

.section .text.n_strutils_$$_ansistartsstr$ansistring$ansistring$$boolean,"ax"
	...
	movq	%rcx,%rbx
	movq	$0,-8(%rbp)
	testq	%rcx,%rcx
	je	.Lj459
	movq	%rcx,%r8
	testq	%rbx,%rbx
	je	.Lj462
	movq	-8(%rbx),%r8
.Lj462:
	leaq	-8(%rbp),%rcx

After:

.section .text.n_strutils_$$_ansistartsstr$ansistring$ansistring$$boolean,"ax"
	...
	movq	%rcx,%rbx
	movq	$0,-8(%rbp)
	testq	%rcx,%rcx
	je	.Lj459
	movq	-8(%rcx),%r8
	leaq	-8(%rbp),%rcx
	...

Simple example in the classes unit - before:

.seh_proc CLASSES$_$TLIST_$__$$_NOTIFY$POINTER$TLISTNOTIFICATION
	...
	movq	%rcx,%rbx
	...
.Lj2467:
	movq	%rdx,%r9
	movl	$2,%r8d
	movq	%rcx,%rdx
	movq	%rbx,%rcx
	call	CLASSES$_$TLIST_$__$$_FPONOTIFYOBSERVERS$TOBJECT$TFPOBSERVEDOPERATION$POINTER
	...

The redundant write to %rbx gets removed - after:

.seh_proc CLASSES$_$TLIST_$__$$_NOTIFY$POINTER$TLISTNOTIFICATION
	...
	movq	%rcx,%rbx
	...
.Lj2467:
	movq	%rdx,%r9
	movl	$2,%r8d
	movq	%rcx,%rdx
	call	CLASSES$_$TLIST_$__$$_FPONOTIFYOBSERVERS$TOBJECT$TFPOBSERVEDOPERATION$POINTER
	...

The second commit alone is able to provide improvements to the final assembly language - in bin2obj for example - before:

	...
.section .text.n_p$bin2obj$_$writememstream_$$_writestrln$shortstring,"ax"
	...
	movq	%rcx,%rbx
	movzbl	(%rdx),%r8d
	addq	$1,%rdx
	movq	-8(%rcx),%rcx
	movq	-8(%rbx),%rax
	movq	(%rax),%rax
	call	*264(%rax)
	...
.section .text.n_p$bin2obj$_$writememstream_$$_writestr$shortstring,"ax"
	...
	movq	%rcx,%rax
	movzbl	(%rdx),%r8d
	addq	$1,%rdx
	movq	-8(%rcx),%rcx
	movq	-8(%rax),%rax
	movq	(%rax),%rax
	call	*264(%rax)
	...

With MovMov2MovMov 2 being performed as soon as possible, the instructions are made more efficient - after:

	...
.section .text.n_p$bin2obj$_$writememstream_$$_writestrln$shortstring,"ax"
	...
	movq	%rcx,%rbx
	movzbl	(%rdx),%r8d
	addq	$1,%rdx
	movq	-8(%rcx),%rcx
	movq	(%rcx),%rax
	call	*264(%rax)
	...
.section .text.n_p$bin2obj$_$writememstream_$$_writestr$shortstring,"ax"
	...
	movzbl	(%rdx),%r8d
	addq	$1,%rdx
	movq	-8(%rcx),%rcx
	movq	(%rcx),%rax
	call	*264(%rax)
	...

[x86] Fixed inaccuracy in the long-range MOV optimisations

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Merge request reports