[AArch64] MOV(Z) reg1,const / op reg2,reg1 Deep Optimisation
Summary
This merge request mark the beginning of some deep peephole optimisations that seek to optimise code were a MOV reg,const
(orMOVZ
) instruction exists and whose destination register is then used in an arithmetic operation by seeing if the constant can be encoded directly in said instruction. A new helper function, named is_arith_const
has been programmed to evaluate whether the constant can be encoded in an ADD
, SUB
, CMN
or ```CMP`` instruction (the numbers 0 to 4095, and 4096 to 16773120 in steps of 4096).
Other instructions, like logical instructions, will be added at a later date, as well optimisations that deal with MOV reg1,reg2
instructions.
Additionally, a node-level optimisation has been implemented with min/max nodes that makes use of is_arith_const
and attempts to directly create CMP reg,const
instructions if the second operand is a constant, thus taking some strain off the peephole optimizer at low optimisation settings and potentially reducing the number of required iterations at higher optimisation settings.
System
- Operating system: Linux (Raspberry Pi OS) and others
- Processor architecture: AArch64
- Device: Raspberry Pi and others
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Code is made smaller and faster in situations where constants are moved into registers.
Relevant logs and/or screenshots
For a simple one to start off with: in the cgbase
unit (aarch64-linux, -O4) - before:
.section .text.n_cgbase_$$_initmms$pmmshuffle$shortint,"ax"
...
.Lc96:
stp x19,x20,[sp, #-16]!
mov x19,x0
sxtb w20,w1
mov w0,w20
mov w1,wzr
cmp w0,w1
csel w0,w0,w1,gt
...
After - the CMP
instruction gets changed from cmp w0,w1
to cmp w0,#0
since w1
is equal to zero (changed to wzr
by the peephole optimizer elsewhere):
.section .text.n_cgbase_$$_initmms$pmmshuffle$shortint,"ax"
...
.Lc96:
stp x19,x20,[sp, #-16]!
mov x19,x0
sxtb w20,w1
mov w0,w20
mov w1,wzr
cmp w0,#0
csel w0,w0,w1,gt
...
Future optimisations will deal with the fact that w0
= w20
and w1
= wzr
throughout much of the block.
In the regexpr
unit - before:
.section .text.n_regexpr_$$__uppercase$ansichar$$ansichar,"ax"
...
movz w0,#32
sub w20,w20,w0
uxtb w20,w20
b .Lj64
.Lj66:
...
After - the number 32 can be encoded in SUB
instructions, and since w0
is deallocated after this instruction, the original MOVZ
gets removed as well:
.section .text.n_regexpr_$$__uppercase$ansichar$$ansichar,"ax"
...
sub w20,w20,#32
uxtb w20,w20
b .Lj64
.Lj66:
...
In the uhpackimp
unit - multple supporting peephole optimizations come into play - before:
.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
...
.Lj606:
str wzr,[x19, #36]
mov w1,wzr
add w0,w0,w1
str w0,[x19, #32]
...
Intermediate - since wzr
is treated like 0 in the new optimisation, the value replaces w1
in the ADD
instruction and then the original MOV
gets removed since w1
is deallocated:
.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
...
.Lj606:
str wzr,[x19, #36]
add w0,w0,wzr
str w0,[x19, #32]
...
After - an additional new peephole optimisation removes add reg,reg,#0/wzr
and sub reg,reg,#0/wzr
instructions since they are identity operations, and then the two STR
instructions are merged into an STP
Instruction since they read contiguous memory:
.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
...
.Lj606:
stp w0,wzr,[x19, #32]
...