[AArch64] CSEL (conditional MOV) generation
Summary
This merge request, based on the x86 CMOV
overhaul over at !532 (merged), upgrades the AArch64 peephole optimizer to generate CSEL
instructions to reduce branching and generate much more efficient code (which is also generally smaller as well - for comparison, the aarch64-linux compiler built under "-O4 -dDEBUG_AOPTCPU" is 4,507,976 bytes in size on the trunk, and 4,487,752 bytes with this patch... that's considering the larger code base that comes with this new optimisation).
System
- Operating system: Linux (Raspberry Pi OS) and others
- Processor architecture: AArch64
- Device: Raspberry Pi and others
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Many conditional branches in the generated assembly language will be made shorter and faster via the use of CSEL
instructions instead of branches.
Relevant logs and/or screenshots
Under aarch64-linux -O4, even small units receive a gain, and it's probably easier to list the units that don't get changed!
To list some simple examples in the math
unit, let us begin with the 64-bit Max
function (the 32-bit version seems to have a CSEL
function generated at a higher stage... something to research at a later date) - before:
.section .text.n_math_$$_min$int64$int64$$int64,"ax"
.balign 8
.globl MATH_$$_MIN$INT64$INT64$$INT64
.type MATH_$$_MIN$INT64$INT64$$INT64,@function
MATH_$$_MIN$INT64$INT64$$INT64:
.Lc618:
cmp x0,x1
b.lt .Lj693
mov x0,x1
.Lj693:
ret
.Lc617:
.Le178:
.size MATH_$$_MIN$INT64$INT64$$INT64, .Le178 - MATH_$$_MIN$INT64$INT64$$INT64
After:
.section .text.n_math_$$_min$int64$int64$$int64,"ax"
.balign 8
.globl MATH_$$_MIN$INT64$INT64$$INT64
.type MATH_$$_MIN$INT64$INT64$$INT64,@function
MATH_$$_MIN$INT64$INT64$$INT64:
.Lc618:
cmp x0,x1
// Peephole Optimization: CSEL Block (Simple type)
csel x0,x1,x0,ge
ret
.Lc617:
.Le178:
.size MATH_$$_MIN$INT64$INT64$$INT64, .Le178 - MATH_$$_MIN$INT64$INT64$$INT64
An example where none of the CSEL
's inputs equal the output is the IfThen
function in the Math unit, which simulates C's ternary operator with a Boolean input and LongInt output - before:
.section .text.n_math_$$_ifthen$boolean$longint$longint$$longint,"ax"
.balign 8
.globl MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT
.type MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT,@function
MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT:
.Lc704:
uxtb w3,w0
mov w0,w1
// Peephole Optimization: CMPB.E/NE2CBNZ/CBZ done
cbnz w3,.Lj821
mov w0,w2
.Lj821:
ret
.Lc703:
.Le212:
.size MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT, .Le212 - MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT
After:
.section .text.n_math_$$_ifthen$boolean$longint$longint$$longint,"ax"
.balign 8
.globl MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT
.type MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT,@function
MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT:
.Lc704:
uxtb w3,w0
// Peephole Optimization: MovCSel2CSel
cmp w3,#0
// Peephole Optimization: CSEL Block (Simple type)
csel w0,w2,w1,eq
ret
.Lc703:
.Le212:
.size MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT, .Le212 - MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT
In aasmcnst
, a "double branching" to the same destination label is demonstrated, which eliminates a label and some instructions - before:
...
.Lj361:
// Peephole Optimization: CMPB.E/NE2CBNZ/CBZ done
cbz w20,.Lj363
// Peephole Optimization: Movz0ToMovZeroReg
mov x0,xzr
b .Lj357
.Lj363:
mov x0,x1
b .Lj357
.Lj358:
// Peephole Optimization: Movz0ToMovZeroReg
mov x0,xzr
.Lj357:
...
After:
...
.Lj361:
cmp w20,#0
// Peephole Optimization: CSEL Block (Double branching (same) type)
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: MovCSel2CSel
csel x0,x1,xzr,eq
b .Lj357
.Lj358:
// Peephole Optimization: Movz0ToMovZeroReg
mov x0,xzr
.Lj357:
...
The classes
unit shows an example where more than one register is set this way - before:
...
.section .text.n_classes$_$tbits_$__$$_equals$tbits$$boolean,"ax"
cmp x2,x3
b.ge .Lj183
mov x19,x0
mov x20,x1
b .Lj184
.Lj183:
mov x19,x1
mov x20,x0
.Lj184:
...
After:
.section .text.n_classes$_$tbits_$__$$_equals$tbits$$boolean,"ax"
...
cmp x2,x3
// Peephole Optimization: CSEL Block (Double type)
// Peephole Optimization: MovCSel2CSel
// Peephole Optimization: MovCSel2CSel
csel x19,x1,x0,ge
csel x20,x0,x1,ge
...
Also in the classes
unit, an example exists where a register is reserved to store a constant - before:
.section .text.n_classes$_$tstream_$__$$_read64$tbytes$int64$int64$$int64,"ax"
...
.Lj222:
sub x22,x21,x23
// Peephole DataMov2Data removed superfluous mov
mov x0,#2147483647
cmp x22,x0
b.le .Lj226
mov x22,#2147483647
.Lj226:
...
After - this can be improved if it can be detected that x0
already contains the desired value:
.section .text.n_classes$_$tstream_$__$$_read64$tbytes$int64$int64$$int64,"ax"
...
mov x0,#2147483647
mov x1,#2147483647
cmp x22,x0
// Peephole Optimization: CSEL Block (Simple type)
csel x22,x1,x22,gt
...
Also in the classes
unit, a more long-range optimisation occurs via the new MovCSel2CSel
optimisation - before (unrelated debug messages removed for clarity):
.section .text.n_classes$_$tstream_$__$$_seek$longint$word$$longint,"ax"
...
mov x4,x19
ldr x5,[sp]
ldr x0,[x5, #280]
mov x6,sp
mov x5,x3
cmp x0,x5
b.ne .Lj264
mov x4,xzr
mov x3,xzr
.Lj264:
...
After - x19
doesn't change value between the MOV
and the conditional block, so it can be removed and the CSEL
instruction that was csel x4,xzr,x4,eq
changed to csel x4,xzr,x19,eq
:
.section .text.n_classes$_$tstream_$__$$_seek$longint$word$$longint,"ax"
...
// Peephole Optimization: MovCSel2CSel
ldr x5,[sp]
ldr x0,[x5, #280]
mov x6,sp
mov x5,x3
cmp x0,x5
// Peephole Optimization: CSEL Block (Simple type)
csel x4,xzr,x19,eq
csel x3,xzr,x3,eq
.Lj264:
...
Future development
- Branches that use
MOVZ/MOVN/MOVK
to set constants are currently not optimised. This will be subject to future development. - Currently,
MOV
instructions that are created to store simple constants are placed afterCMP
instructions and are not relocated to precede them (assuming theMOV
instruction doesn't write to the register being compared). In x86, this is achieved via theTrySwapMovCmp
optimisation routine, but this depends on x86-specific functionality, specificallyInsProp
inTrySwapMovOp
. An AArch64 version will be developed at a later date. - The
CSel2Mov
optimisation may be deprecated at a later date since it optimises situations where aCSEL
instruction is generated with the same true/false register, and is a sign of inefficient user code that can possibly be simplified at a higher level. This is demonstrated in theaasmcpu
unit - before:
...
.Lj511:
// Peephole Optimization: CMPB.E/NE2CBNZ/CBZ done
cbnz w1,.Lj516
// Peephole Optimization: Movz0ToMovZeroReg
mov w0,wzr
b .Lj508
.Lj516:
// Peephole Optimization: Movz0ToMovZeroReg
mov w0,wzr
b .Lj508
.Lj512:
mov w2,w1
cmp w2,#2
b.hs .Lj519
// Peephole Optimization: Movz0ToMovZeroReg
mov w0,wzr
b .Lj508
.Lj519:
// Peephole Optimization: Movz0ToMovZeroReg
mov w0,wzr
b .Lj508
.Lj513:
...
After:
...
.Lj511:
cmp w1,#0
// Peephole Optimization: CSEL Block (Double branching (same) type)
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: MovCSel2CSel
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: CSel2Mov (identical true/false registers)
mov w0,wzr
b .Lj508
.Lj512:
mov w2,w1
cmp w2,#2
// Peephole Optimization: CSEL Block (Double branching (same) type)
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: MovCSel2CSel
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: CSel2Mov (identical true/false registers)
mov w0,wzr
b .Lj508
.Lj513:
...