[ARM/AArch64] Single-instruction MOV sanitation
Summary
This merge request has two distinct optimisations that convert lone instructions into MOV equivalents:
-
movz reg,#0, when by itself (not part of amovz/movkpair, for example), is changed intomov reg,xzr(ormov reg,wzr, depending on the register size). -
sbfxandubfxinstructions get converted tomovinstructions if the lsb and width fields cover the entire register (performed in the pre-peephole stage).
(And also changing a debug message slightly so it follows the standard)
System
- Operating system: Linux (Raspberry Pi OS) and others
- Processor architecture: ARM, AArch64
- Device: Raspberry Pi 400 (and others)
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
New optimisations should now sometimes be made as a result of an optimisation cascade.
Relevant logs and/or screenshots
Though the optimisations themselves don't reduce instruction size or increase speed, they do sometimes open up other optimisations. One profound example can be found in the TStream.Seek method of the Classes unit - before:
...
.Lj308:
...
mov x4,x20
mov x0,sp
ldr x5,[sp]
ldr x5,[x5, #280]
mov x6,sp
ubfx x0,x5,#0,#64
ubfx x5,x3,#0,#64
cmp x0,x5
b.ne .Lj314
movz x5,#0
movz x0,#0
mov x3,x0
mov x4,x5
.Lj314:
ubfx x0,x3,#0,#64
...
After:
...
.Lj308:
...
mov x4,x20
ldr x5,[sp]
ldr x0,[x5, #280]
mov x6,sp
mov x5,x3
cmp x0,x5
b.ne .Lj314
mov x4,xzr
mov x3,xzr
.Lj314:
mov x0,x3
...
Converting the first two ubfx instructions opened up a simplification of ldr x5,[x5, #280], causing it to be changed into ldr x0,[x5, #280] and the complete removal of what is now mov x0,x5, since x5 gets overwritten on the very next instruction (the cmp instruction could be changed to cmp x0,x3 to possibly allow the removal of x5 completely, but we'll deal with that another day).
Meanwhile, changing movz reg,#0 to mov reg,xzr permitted an additional optimisation after observing that x4 is set to x5, and x3 is set to x0, and then x0 and x5 are discarded - it's generally easier to track matching registers than immediate values, and there is future potential to optimise this further, from:
...
b.ne .Lj314
mov x4,xzr
mov x3,xzr
.Lj314:
...
To:
...
csel x4,xzr,x4,ne
csel x3,xzr,x3,ne
...
With careful observation of the code that appears prior, x4 gets set to x20, so this can be improved further:
...
csel x4,xzr,x20,ne
csel x3,xzr,x3,ne
...