[ARM / AArch64] AND/CMP -> TST optimisation and supporting code
Summary
This merge request aims to optimise the intermediate assembly language produced under Arm and AArch64 by optimising AND/CMP pairs into TST instructions and eliminating the temporary register used (the peephole optimizer attempts to remove the relevant allocation and deallocation entries), thereby reducing instruction count, increasing speed and permitting more future optimisations. It is split into five distinct parts:
- First, some minor refactoring and utility functions
- First, a general-purpose
TryRemoveRegAllocroutine that attempts to remove the allocation and deallocation entries for a particular register between two instructions. -
FindRegAllocBackwardhas been modified so it returns nil if the firsttai_regallocthat it finds for the given register is a deallocation. This allows optimisations that use this function to be simplified by removing the need to check to see if the returned object is not a deallocation (also improving maintainability by eliminating bugs where this check might be forgotten). - The optimisation routines
OptPostCMPandOptPostAndfor Arm and AArch64 have been renamed toPostPeepholeOptCMPandPostPeepholeOptANDfor internal consistency.
- First, a general-purpose
- A new optimisation that converts
AND/CMPinto eitherTSTorANDSinstructions, depending on if the temporary register is used afterwards (AND/CMPtoANDSis already performed by theOpCmp2OpSoptimisation, but since all of the checks have been performed by this point, it would just waste time to not change it toANDSat this point). - A new
PostPeepholeOptTSTroutine has been introduced that aims to convertTST/B.cpairs intoTBZorTBNZinstructions. This is to compensate for the fact thatAND/CMPpairs have been simplified and will no longer be picked up byPostPeepholeOptAND. - The
PostPeepholeOptANDroutine has been removed since it no longer catches the optimisation it was written for (PostPeepholeOptTSTand the existingPostPeepholeOptCMPfulfil these roles. See previous point). - A new
OptPass2TSTroutine aims to catch a more uncommon optimisation whereAND/CMPis converted intoTST, but an identicalANDappears after the conditional jump, thus a saving can be made if the newTSTinstruction is converted intoANDS(and the register that theANDinstruction writes to isn't in use). -
RegInInstruction,RegModifiedByInstructionandRegLoadedWithNewValuehave been upgraded on ARM and AArch64 to better handle the flags register. -
OptPass2ANDandOptPass2CMPwill attempt to remove vestigialCMPandTSTroutines (i.e. those that no longer have conditional statements following them).
Additionally, due to a particular occurrence in the Classes unit, the new AND; CMP -> TST optimisation will look out for AND; CMP; B.c; AND combinations, where the second AND has the same inputs as the first AND (the destination register can be the same or different - for the former it is removed completely, and with the latter it's changed to MOV. In both cases, the first AND is changed to ANDS).
System
- Operating system: Linux (Raspberry Pi OS)
- Processor architecture: Arm, AArch64
- Device: Raspberry Pi 400
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Some conditional code is now more efficient.
Relevant logs and/or screenshots
In the cclasses unit under aarch64-linux (-O4) - before:
.section .text.n_cclasses$_$tbitset_$__$$_isset$longint$$boolean,"ax"
.balign 8
.globl CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN
.type CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN,@function
CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN:
.Lc804:
...
and w0,w0,#1
cmp w0,#0
cset w0,ne
b .Lj1274
After:
.section .text.n_cclasses$_$tbitset_$__$$_isset$longint$$boolean,"ax"
.balign 8
.globl CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN
.type CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN,@function
CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN:
.Lc804:
...
tst w0,#1
cset w0,ne
b .Lj1274
In the sysutils unit - before:
...
.Lj163:
...
and x1,x1,#1
cbz w1,.Lj162
.Lj166:
...
After - in this situation, the post-peephole optimisation is different. Originally it was and/cmp/b.eq and was changed to and/cbz, but now it's tst/b.eq (with tst made from converting and/cmp) being changed to tbz:
...
.Lj163:
...
tbz x1,#0,.Lj162
.Lj166:
...
A common change that appears under aarch64-linux (-O4) in the classes unit - before:
...
.Lj42:
and x0,x20,#63
cbz x0,.Lj46
...
After:
...
.Lj42:
tst x20,#63
b.eq .Lj46
...
Though the instruction count is the same, it reduces the number of registers used.
For the OptPass2TST optimisation, this is to catch a couple of instances where after an AND/CMP/B.c triplet is converted into TST/B.c pair, an instruction identical to the original AND appears after the conditional jump, most likely with the same destination register as the one that was deallocated. In this instance, converting the TST into an ANDS removes an additional instruction. For the classes example - before:
...
.Lj42:
and x0,x20,#63
cbz x0,.Lj46
and x0,x20,#63
...
After (without OptPass2TST):
...
.Lj42:
tst x20,#63
b.eq .Lj46
and x0,x20,#63
...
After (with OptPass2TST):
...
.Lj42:
ands x0,x20,#63
b.eq .Lj46
...
The fpttfsubsetter unit has a similar example.
To show arm-linux some love - in the classes unit - before:
...
.Lj137:
ldr r1,[r0, #12]
ands r1,r1,#31
beq .Lj142
ldr r2,[r0, #12]
and r1,r2,#31
...
AFter:
...
.Lj137:
ldr r1,[r0, #12]
tst r1,#31
beq .Lj142
ldr r2,[r0, #12]
and r1,r2,#31
...
Turning ANDS into TST is a common change, and while it doesn't give a direct saving (unless you count the microwatts of power saved from not writing to a register), it can permit future optimisations. In this case, r1 will retain the value referenced by [r0, #12], so the second LDR instruction could be changed to mov r2,r1 and permit some register simplification (e.g. by changing what is now a MOV/AND combination to and r1,r1,#31.