[Cross-platform] New register allocation information for subroutine calls (!537) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:volatile-reg into main Nov 16, 2023

Summary

This merge request introduces two new register allocation types for the tai_regalloc class, named "ra_actualparam" and "ra_trashed", to aid peephole optimisation around subroutine calls. The tcgcallnode.pass_generate_code has been updated to generate these hints either side of procedure calls.

To showcase this, in the x86 peephole optimizer, the RegReadByInstruction, RegInInstruction, RegModifiedByInstruction and RegLoadedWithNewValue methods have been adapted to study the register allocation hints either side of CALL instructions using two new helper methods: RegUsedByCall and RegTrashedByCall. These helper methods, in principle, could be made cross-platform, but are currently x86 only on account of how individual architectures can have wildly different methods for calling subroutines.

RegUsedByCall - returns True if either the specified register appears in the reference (in the case of using procedural variables, for example), or is listed as an actual parameter just prior to the instruction (will raise an internal error if not CALL).
RegTrashedByCall - returns True if the register is listed as trashed just after the instruction (will also raise an internal error if not CALL).
For CALL instructions only on x86:
- RegReadByInstruction - will return True if RegUsedByCall returns True, otherwise False.
- RegInInstruction - will return True if either RegUsedByCall or RegTrashedByCall returns True, otherwise False.
- RegModifiedByInstruction - will return True if RegTrashedByCall returns True (this includes return values), otherwise False.
- RegLoadedWithNewValue - will return True if and only if RegTrashedByCall returns True and RegUsedByCall returns False, thus implying that the register's new value doesn't depend on its old one.

Aa a result, CALL instructions can essentially be skipped over by GetNextInstructionUsingReg and the like if the current register is one that is preserved, allowing more long-range optimisations to take place.

Note that the peephole optimizer itself makes no assumptions about the calling convention - all of the information regarding used and trashed registers is provided at the pass_generate_code stage, including if the function result is actually used, for example. All the peephole optimizer sees is a list of registers that are trashed after the call (which includes the function result) and those used as actual parameters (they aren't listed if such parameters don't exist - for example, under the Microsoft ABI, if a function only has 2 parameters, only %rcx and %rdx are marked as actual parameters, not %r8 and %r9).

The main drawbacks is that compiled programs may now be more vulnerable to subroutines that contain malformed assembly language that, say, write to registers that haven't been preserved properly, but this will have adverse effects in all contexts and is the fault of the third party. Additionally the peephole optimizer will run slightly slower because of the additional ra_actualparam and ra_trashed entries for each applicable register either side of CALL instructions.

System

Processor architecture: All (currently only i386 and x86_64)

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

On i386 and x86_64, peephole optimisations around CALL functions are improved.
On other processors, currently there is no change, but support can be easily introduced.

Relevant logs and/or screenshots

Virtually every source file sees an improvement. The most common improvement is seen around calling object methods, which appears literally hundreds of times in the compiler, RTL and packages. To use aasmcnst as an example (x86_64-win64, -O4), before:

.section .text.n_aasmcnst$_$ttai_typedconstbuilder_$__$$_finalize_asmlist_add_indirect_sym$hgddo9i486zn,"ax"
	...
# Peephole Optimization: MovMov2MovMov 2
	movq	U_$SYMDEF_$$_CPOINTERDEF(%rip),%rcx
	movq	%rcx,%rax
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
	call	*448(%rcx)
	...

The peephole optimizer is able to change the register in the call reference from %rax to %rcx, but because it doesn't have any information in regards to how the call uses or affects %rax, it cannot optimise any further. However, with the new allocation hints... after:

# Peephole Optimization: MovMov2MovMov 2
	movq	U_$SYMDEF_$$_CPOINTERDEF(%rip),%rcx
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
	call	*448(%rcx)

The MOV gets culled completely, since the peephole optimizer how has enough information to know that %rax gets trashed. Using the -ar option to show the hints:

	# Register rcx allocated
# Peephole Optimization: MovMov2MovMov 2
	movq	U_$SYMDEF_$$_CPOINTERDEF(%rip),%rcx
	# Register rax allocated
# Peephole Optimization: Mov2Nop 3 done
	# Register rdx parameter
	# Register r8,r9,r10,r11 allocated
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
	call	*448(%rcx)
	# Register rax,rdx,r8,r9,r10,r11 trashed
	# Register rdx,rdx,r8,r9,r10,r11 released
	# Register rdi allocated
	# Register rcx released

Due to the workings of the compiler, the hidden first parameter isn't included, but "Register rdx parameter" indicates that %rdx is being used as an actual parameter (and a formal parameter in what is the DefineAsmSymbol method) and that it trashes %rax, %rdx, %r8, %r9, %r10, %r11... essentially, all of the volatile registers and those that are used as a function result. The difference with the function result is that it doesn't get deallocated.

In the system unit, a similar optimisation takes place, but slightly earlier - before:

	...
.Lj3797:
	movw	%di,%r12w
	movq	%rsi,%rcx
	movq	%rbx,%rdx
	movq	%r13,%r9
# Peephole Optimization: %r12w = %di; changed to minimise pipeline stall (MovXXX2MovXXX)
	movzwl	%di,%r8d
	call	*U_$SYSTEM_$$_WIDESTRINGMANAGER(%rip)
	...

Since other peephole optimisations have removed the use of %r12, the initial MOV that writes to it can also be removed because the peephole optimizer now knows that %r12 is preserved by the call, but is also not used afterwards - after:

	...
.Lj3797:
# Peephole Optimization: Mov2Nop 3b done
	movq	%rsi,%rcx
	movq	%rbx,%rdx
	movq	%r13,%r9
# Peephole Optimization: %r12w = %di; changed to minimise pipeline stall (MovXXX2MovXXX)
	movzwl	%di,%r8d
	call	*U_$SYSTEM_$$_WIDESTRINGMANAGER(%rip)
	...

In sysutils, the TEST/Jcc/TEST optimisations can get some more work done - before:

	...
.Lj202:
	testl	%edi,%edi
# Peephole Optimization: TEST/Jcc/@Lbl/TEST/Jcc -> TEST/Jcc, redirecting first jump
	jng	.Lj208
	movslq	%edi,%rax
	movzwl	-2(%rsi,%rax,2),%ecx
	leaq	32(%rsp),%rdx
	call	SYSUTILS_$$_CHARINSET$WIDECHAR$TSYSCHARSET$$BOOLEAN
	testb	%al,%al
	je	.Lj201
	testl	%edi,%edi
	jng	.Lj208
	movslq	%edi,%r9
	movq	%rbx,%rcx
	...

After:

	...
.Lj202:
	testl	%edi,%edi
# Peephole Optimization: TEST/Jcc/@Lbl/TEST/Jcc -> TEST/Jcc, redirecting first jump
	jng	.Lj208
	movslq	%edi,%rax
	movzwl	-2(%rsi,%rax,2),%ecx
	leaq	32(%rsp),%rdx
	call	SYSUTILS_$$_CHARINSET$WIDECHAR$TSYSCHARSET$$BOOLEAN
	testb	%al,%al
	je	.Lj201
# Peephole Optimization: TEST/Jcc/TEST; removed superfluous TEST
# Peephole Optimization: Removed dominated jump (via TEST/Jcc/TEST)
	movslq	%edi,%r9
	movq	%rbx,%rcx
	...

In the same unit, Pass 2 is a little more efficient, removing a superfluous ADD instruction - before:

	...
.Lj2200
	...
# Peephole Optimization: Lea2Add done
	leaq	1(%rax),%r12
# Peephole Optimization: AddMov2LeaAdd
	leaq	1(%rax),%rcx
# Peephole Optimization: AddMov2LeaAdd
	addq	$1,%rax
	leaq	-1(%rdi),%r8
	leaq	1(%rsi),%rdx
# Peephole Optimization: %r12 = %rax; changed to minimise pipeline stall (MovMov2Mov 6a}
	call	SYSTEM_$$_COMPARECHAR$formal$formal$INT64$$INT64
	...

After (note that the write to %r12 isn't removed here, despite the pipeline stall optimisation, because it is used after the call):

	...
.Lj2200
	...
# Peephole Optimization: Lea2Add done
	leaq	1(%rax),%r12
# Peephole Optimization: AddMov2LeaAdd
	leaq	-1(%rdi),%r8
	leaq	1(%rsi),%rdx
# Peephole Optimization: %r12 = %rax; changed to minimise pipeline stall (MovMov2Mov 6a}
# Peephole Optimization: AddMov2Lea
	leaq	1(%rax),%rcx
	call	SYSTEM_$$_COMPARECHAR$formal$formal$INT64$$INT64
	...

Also in sysutils, a different TEST/Jcc/TEST optimisation takes place - before:

	...
.Lj2652:
# Peephole Optimization: Cmp1Jl2Cmp0Jle
	testq	%r14,%r14
# Peephole Optimization: CMP/Jcc/@Lbl/CMP/Jcc -> CMP/Jcc, redirecting first jump
	jle	.Lj2657
	movq	(%rbx),%rax
	movzbl	-1(%rax,%r14,1),%ecx
	movq	%rdi,%rdx
	movq	%r12,%r8
	call	SYSUTILS_$$_HAVECHAR$ANSICHAR$array_of_ANSICHAR$$BOOLEAN
	testb	%al,%al
	jne	.Lj2651
# Peephole Optimization: Cmp1Jl2Cmp0Jle
	testq	%r14,%r14
	jnle	.Lj2658
.Lj2657:
	...

After:

	...
.Lj2652:
# Peephole Optimization: Cmp1Jl2Cmp0Jle
	testq	%r14,%r14
# Peephole Optimization: CMP/Jcc/@Lbl/CMP/Jcc -> CMP/Jcc, redirecting first jump
	jle	.Lj2657
	movq	(%rbx),%rax
	movzbl	-1(%rax,%r14,1),%ecx
	movq	%rdi,%rdx
	movq	%r12,%r8
	call	SYSUTILS_$$_HAVECHAR$ANSICHAR$array_of_ANSICHAR$$BOOLEAN
	testb	%al,%al
	jne	.Lj2651
# Peephole Optimization: Cmp1Jl2Cmp0Jle
# Peephole Optimization: TEST/Jcc/TEST; removed superfluous TEST
# Peephole Optimization: Conditional jump will always branch (via TEST/Jcc/TEST)
	jmp	.Lj2658
.Lj2657:
	...

Edited Nov 16, 2023 by J. Gareth "Kit" Moreton

[Cross-platform] New register allocation information for subroutine calls

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Merge request reports