[Cross-platform] New register allocation information for subroutine calls
Summary
This merge request introduces two new register allocation types for the tai_regalloc
class, named "ra_actualparam" and "ra_trashed", to aid peephole optimisation around subroutine calls. The tcgcallnode.pass_generate_code
has been updated to generate these hints either side of procedure calls.
To showcase this, in the x86 peephole optimizer, the RegReadByInstruction
, RegInInstruction
, RegModifiedByInstruction
and RegLoadedWithNewValue
methods have been adapted to study the register allocation hints either side of CALL
instructions using two new helper methods: RegUsedByCall
and RegTrashedByCall
. These helper methods, in principle, could be made cross-platform, but are currently x86 only on account of how individual architectures can have wildly different methods for calling subroutines.
-
RegUsedByCall
- returns True if either the specified register appears in the reference (in the case of using procedural variables, for example), or is listed as an actual parameter just prior to the instruction (will raise an internal error if notCALL
). -
RegTrashedByCall
- returns True if the register is listed as trashed just after the instruction (will also raise an internal error if notCALL
). - For
CALL
instructions only on x86:-
RegReadByInstruction
- will return True ifRegUsedByCall
returns True, otherwise False. -
RegInInstruction
- will return True if eitherRegUsedByCall
orRegTrashedByCall
returns True, otherwise False. -
RegModifiedByInstruction
- will return True ifRegTrashedByCall
returns True (this includes return values), otherwise False. -
RegLoadedWithNewValue
- will return True if and only ifRegTrashedByCall
returns True andRegUsedByCall
returns False, thus implying that the register's new value doesn't depend on its old one.
-
Aa a result, CALL
instructions can essentially be skipped over by GetNextInstructionUsingReg
and the like if the current register is one that is preserved, allowing more long-range optimisations to take place.
Note that the peephole optimizer itself makes no assumptions about the calling convention - all of the information regarding used and trashed registers is provided at the pass_generate_code
stage, including if the function result is actually used, for example. All the peephole optimizer sees is a list of registers that are trashed after the call (which includes the function result) and those used as actual parameters (they aren't listed if such parameters don't exist - for example, under the Microsoft ABI, if a function only has 2 parameters, only %rcx and %rdx are marked as actual parameters, not %r8 and %r9).
The main drawbacks is that compiled programs may now be more vulnerable to subroutines that contain malformed assembly language that, say, write to registers that haven't been preserved properly, but this will have adverse effects in all contexts and is the fault of the third party. Additionally the peephole optimizer will run slightly slower because of the additional ra_actualparam
and ra_trashed
entries for each applicable register either side of CALL
instructions.
System
- Processor architecture: All (currently only i386 and x86_64)
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
- On i386 and x86_64, peephole optimisations around CALL functions are improved.
- On other processors, currently there is no change, but support can be easily introduced.
Relevant logs and/or screenshots
Virtually every source file sees an improvement. The most common improvement is seen around calling object methods, which appears literally hundreds of times in the compiler, RTL and packages. To use aasmcnst
as an example (x86_64-win64, -O4), before:
.section .text.n_aasmcnst$_$ttai_typedconstbuilder_$__$$_finalize_asmlist_add_indirect_sym$hgddo9i486zn,"ax"
...
# Peephole Optimization: MovMov2MovMov 2
movq U_$SYMDEF_$$_CPOINTERDEF(%rip),%rcx
movq %rcx,%rax
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
call *448(%rcx)
...
The peephole optimizer is able to change the register in the call reference from %rax
to %rcx
, but because it doesn't have any information in regards to how the call uses or affects %rax
, it cannot optimise any further. However, with the new allocation hints... after:
# Peephole Optimization: MovMov2MovMov 2
movq U_$SYMDEF_$$_CPOINTERDEF(%rip),%rcx
# Peephole Optimization: Mov2Nop 3 done
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
call *448(%rcx)
The MOV
gets culled completely, since the peephole optimizer how has enough information to know that %rax
gets trashed. Using the -ar
option to show the hints:
# Register rcx allocated
# Peephole Optimization: MovMov2MovMov 2
movq U_$SYMDEF_$$_CPOINTERDEF(%rip),%rcx
# Register rax allocated
# Peephole Optimization: Mov2Nop 3 done
# Register rdx parameter
# Register r8,r9,r10,r11 allocated
# Peephole Optimization: %rax = %rcx; changed to minimise pipeline stall (MovXXX2MovXXX)
call *448(%rcx)
# Register rax,rdx,r8,r9,r10,r11 trashed
# Register rdx,rdx,r8,r9,r10,r11 released
# Register rdi allocated
# Register rcx released
Due to the workings of the compiler, the hidden first parameter isn't included, but "Register rdx parameter" indicates that %rdx is being used as an actual parameter (and a formal parameter in what is the DefineAsmSymbol
method) and that it trashes %rax, %rdx, %r8, %r9, %r10, %r11... essentially, all of the volatile registers and those that are used as a function result. The difference with the function result is that it doesn't get deallocated.
In the system
unit, a similar optimisation takes place, but slightly earlier - before:
...
.Lj3797:
movw %di,%r12w
movq %rsi,%rcx
movq %rbx,%rdx
movq %r13,%r9
# Peephole Optimization: %r12w = %di; changed to minimise pipeline stall (MovXXX2MovXXX)
movzwl %di,%r8d
call *U_$SYSTEM_$$_WIDESTRINGMANAGER(%rip)
...
Since other peephole optimisations have removed the use of %r12, the initial MOV
that writes to it can also be removed because the peephole optimizer now knows that %r12 is preserved by the call, but is also not used afterwards - after:
...
.Lj3797:
# Peephole Optimization: Mov2Nop 3b done
movq %rsi,%rcx
movq %rbx,%rdx
movq %r13,%r9
# Peephole Optimization: %r12w = %di; changed to minimise pipeline stall (MovXXX2MovXXX)
movzwl %di,%r8d
call *U_$SYSTEM_$$_WIDESTRINGMANAGER(%rip)
...
In sysutils
, the TEST/Jcc/TEST
optimisations can get some more work done - before:
...
.Lj202:
testl %edi,%edi
# Peephole Optimization: TEST/Jcc/@Lbl/TEST/Jcc -> TEST/Jcc, redirecting first jump
jng .Lj208
movslq %edi,%rax
movzwl -2(%rsi,%rax,2),%ecx
leaq 32(%rsp),%rdx
call SYSUTILS_$$_CHARINSET$WIDECHAR$TSYSCHARSET$$BOOLEAN
testb %al,%al
je .Lj201
testl %edi,%edi
jng .Lj208
movslq %edi,%r9
movq %rbx,%rcx
...
After:
...
.Lj202:
testl %edi,%edi
# Peephole Optimization: TEST/Jcc/@Lbl/TEST/Jcc -> TEST/Jcc, redirecting first jump
jng .Lj208
movslq %edi,%rax
movzwl -2(%rsi,%rax,2),%ecx
leaq 32(%rsp),%rdx
call SYSUTILS_$$_CHARINSET$WIDECHAR$TSYSCHARSET$$BOOLEAN
testb %al,%al
je .Lj201
# Peephole Optimization: TEST/Jcc/TEST; removed superfluous TEST
# Peephole Optimization: Removed dominated jump (via TEST/Jcc/TEST)
movslq %edi,%r9
movq %rbx,%rcx
...
In the same unit, Pass 2 is a little more efficient, removing a superfluous ADD
instruction - before:
...
.Lj2200
...
# Peephole Optimization: Lea2Add done
leaq 1(%rax),%r12
# Peephole Optimization: AddMov2LeaAdd
leaq 1(%rax),%rcx
# Peephole Optimization: AddMov2LeaAdd
addq $1,%rax
leaq -1(%rdi),%r8
leaq 1(%rsi),%rdx
# Peephole Optimization: %r12 = %rax; changed to minimise pipeline stall (MovMov2Mov 6a}
call SYSTEM_$$_COMPARECHAR$formal$formal$INT64$$INT64
...
After (note that the write to %r12 isn't removed here, despite the pipeline stall optimisation, because it is used after the call):
...
.Lj2200
...
# Peephole Optimization: Lea2Add done
leaq 1(%rax),%r12
# Peephole Optimization: AddMov2LeaAdd
leaq -1(%rdi),%r8
leaq 1(%rsi),%rdx
# Peephole Optimization: %r12 = %rax; changed to minimise pipeline stall (MovMov2Mov 6a}
# Peephole Optimization: AddMov2Lea
leaq 1(%rax),%rcx
call SYSTEM_$$_COMPARECHAR$formal$formal$INT64$$INT64
...
Also in sysutils
, a different TEST/Jcc/TEST
optimisation takes place - before:
...
.Lj2652:
# Peephole Optimization: Cmp1Jl2Cmp0Jle
testq %r14,%r14
# Peephole Optimization: CMP/Jcc/@Lbl/CMP/Jcc -> CMP/Jcc, redirecting first jump
jle .Lj2657
movq (%rbx),%rax
movzbl -1(%rax,%r14,1),%ecx
movq %rdi,%rdx
movq %r12,%r8
call SYSUTILS_$$_HAVECHAR$ANSICHAR$array_of_ANSICHAR$$BOOLEAN
testb %al,%al
jne .Lj2651
# Peephole Optimization: Cmp1Jl2Cmp0Jle
testq %r14,%r14
jnle .Lj2658
.Lj2657:
...
After:
...
.Lj2652:
# Peephole Optimization: Cmp1Jl2Cmp0Jle
testq %r14,%r14
# Peephole Optimization: CMP/Jcc/@Lbl/CMP/Jcc -> CMP/Jcc, redirecting first jump
jle .Lj2657
movq (%rbx),%rax
movzbl -1(%rax,%r14,1),%ecx
movq %rdi,%rdx
movq %r12,%r8
call SYSUTILS_$$_HAVECHAR$ANSICHAR$array_of_ANSICHAR$$BOOLEAN
testb %al,%al
jne .Lj2651
# Peephole Optimization: Cmp1Jl2Cmp0Jle
# Peephole Optimization: TEST/Jcc/TEST; removed superfluous TEST
# Peephole Optimization: Conditional jump will always branch (via TEST/Jcc/TEST)
jmp .Lj2658
.Lj2657:
...