[ARM / AArch64] AND/CMP -> TST optimisation and supporting code
Summary
This merge request aims to optimise the intermediate assembly language produced under Arm and AArch64 by optimising AND/CMP
pairs into TST
instructions and eliminating the temporary register used (the peephole optimizer attempts to remove the relevant allocation and deallocation entries), thereby reducing instruction count, increasing speed and permitting more future optimisations. It is split into five distinct parts:
- First, some minor refactoring and utility functions
- First, a general-purpose
TryRemoveRegAlloc
routine that attempts to remove the allocation and deallocation entries for a particular register between two instructions. -
FindRegAllocBackward
has been modified so it returns nil if the firsttai_regalloc
that it finds for the given register is a deallocation. This allows optimisations that use this function to be simplified by removing the need to check to see if the returned object is not a deallocation (also improving maintainability by eliminating bugs where this check might be forgotten). - The optimisation routines
OptPostCMP
andOptPostAnd
for Arm and AArch64 have been renamed toPostPeepholeOptCMP
andPostPeepholeOptAND
for internal consistency.
- First, a general-purpose
- A new optimisation that converts
AND/CMP
into eitherTST
orANDS
instructions, depending on if the temporary register is used afterwards (AND/CMP
toANDS
is already performed by theOpCmp2OpS
optimisation, but since all of the checks have been performed by this point, it would just waste time to not change it toANDS
at this point). - A new
PostPeepholeOptTST
routine has been introduced that aims to convertTST/B.c
pairs intoTBZ
orTBNZ
instructions. This is to compensate for the fact thatAND/CMP
pairs have been simplified and will no longer be picked up byPostPeepholeOptAND
. - The
PostPeepholeOptAND
routine has been removed since it no longer catches the optimisation it was written for (PostPeepholeOptTST
and the existingPostPeepholeOptCMP
fulfil these roles. See previous point). - A new
OptPass2TST
routine aims to catch a more uncommon optimisation whereAND/CMP
is converted intoTST
, but an identicalAND
appears after the conditional jump, thus a saving can be made if the newTST
instruction is converted intoANDS
(and the register that theAND
instruction writes to isn't in use). -
RegInInstruction
,RegModifiedByInstruction
andRegLoadedWithNewValue
have been upgraded on ARM and AArch64 to better handle the flags register. -
OptPass2AND
andOptPass2CMP
will attempt to remove vestigialCMP
andTST
routines (i.e. those that no longer have conditional statements following them).
Additionally, due to a particular occurrence in the Classes unit, the new AND; CMP -> TST
optimisation will look out for AND; CMP; B.c; AND
combinations, where the second AND
has the same inputs as the first AND
(the destination register can be the same or different - for the former it is removed completely, and with the latter it's changed to MOV
. In both cases, the first AND
is changed to ANDS
).
System
- Operating system: Linux (Raspberry Pi OS)
- Processor architecture: Arm, AArch64
- Device: Raspberry Pi 400
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Some conditional code is now more efficient.
Relevant logs and/or screenshots
In the cclasses
unit under aarch64-linux (-O4) - before:
.section .text.n_cclasses$_$tbitset_$__$$_isset$longint$$boolean,"ax"
.balign 8
.globl CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN
.type CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN,@function
CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN:
.Lc804:
...
and w0,w0,#1
cmp w0,#0
cset w0,ne
b .Lj1274
After:
.section .text.n_cclasses$_$tbitset_$__$$_isset$longint$$boolean,"ax"
.balign 8
.globl CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN
.type CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN,@function
CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN:
.Lc804:
...
tst w0,#1
cset w0,ne
b .Lj1274
In the sysutils
unit - before:
...
.Lj163:
...
and x1,x1,#1
cbz w1,.Lj162
.Lj166:
...
After - in this situation, the post-peephole optimisation is different. Originally it was and/cmp/b.eq
and was changed to and/cbz
, but now it's tst/b.eq
(with tst
made from converting and/cmp
) being changed to tbz
:
...
.Lj163:
...
tbz x1,#0,.Lj162
.Lj166:
...
A common change that appears under aarch64-linux (-O4) in the classes
unit - before:
...
.Lj42:
and x0,x20,#63
cbz x0,.Lj46
...
After:
...
.Lj42:
tst x20,#63
b.eq .Lj46
...
Though the instruction count is the same, it reduces the number of registers used.
For the OptPass2TST
optimisation, this is to catch a couple of instances where after an AND/CMP/B.c
triplet is converted into TST/B.c
pair, an instruction identical to the original AND
appears after the conditional jump, most likely with the same destination register as the one that was deallocated. In this instance, converting the TST
into an ANDS
removes an additional instruction. For the classes
example - before:
...
.Lj42:
and x0,x20,#63
cbz x0,.Lj46
and x0,x20,#63
...
After (without OptPass2TST
):
...
.Lj42:
tst x20,#63
b.eq .Lj46
and x0,x20,#63
...
After (with OptPass2TST
):
...
.Lj42:
ands x0,x20,#63
b.eq .Lj46
...
The fpttfsubsetter
unit has a similar example.
To show arm-linux some love - in the classes
unit - before:
...
.Lj137:
ldr r1,[r0, #12]
ands r1,r1,#31
beq .Lj142
ldr r2,[r0, #12]
and r1,r2,#31
...
AFter:
...
.Lj137:
ldr r1,[r0, #12]
tst r1,#31
beq .Lj142
ldr r2,[r0, #12]
and r1,r2,#31
...
Turning ANDS
into TST
is a common change, and while it doesn't give a direct saving (unless you count the microwatts of power saved from not writing to a register), it can permit future optimisations. In this case, r1
will retain the value referenced by [r0, #12]
, so the second LDR
instruction could be changed to mov r2,r1
and permit some register simplification (e.g. by changing what is now a MOV/AND
combination to and r1,r1,#31
.