[ARM / AArch64] Minimisation of zero and sign extension instructions (UXTB etc.)
Summary
This merge request introduces a new peephole optimisation that attempts to remove unnecessary register extension operations (UXTB
, UXTH
, SXTB
and SXTH
) by analysing the instructions that follow to see if the extended bits have any bearing on the final result.
System
- **Processor architecture: ARM, AArch64
- Device: Tested on Raspberry Pi 4 and Raspberry Pi 400
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
The number of unnecessary zero and sign extension operations is significantly reduced (at least under -O3
).
Relevant logs and/or screenshots
Under ARMv7A (arm-linux, -O4 -CpARMV7A -OpARMV7A
(although these optimisations also take effect without specifying ARMv7-A), a large number of units receive improvements. To begin with a simple example, in the colortxt
unit - before:
...
.globl COLORTXT$_$TCOLOREDTEXT_$__$$_GETTHECOLOR$$BYTE
...
ldrb r4,[r0, #64]
b .Lj36
.Lj35:
mov r1,#1
bl VIEWS$_$TVIEW_$__$$_GETCOLOR$WORD$$WORD
uxth r0,r0
and r4,r0,#255
.Lj36:
and r0,r4,#255
After - due to the use of AND
against the constant 255, everything above the least-significant 8 bits of r0
is masked out as it is written to r4
:
...
.globl COLORTXT$_$TCOLOREDTEXT_$__$$_GETTHECOLOR$$BYTE
...
ldrb r4,[r0, #64]
b .Lj36
.Lj35:
mov r1,#1
bl VIEWS$_$TVIEW_$__$$_GETCOLOR$WORD$$WORD
and r4,r0,#255
.Lj36:
and r0,r4,#255
...
In the aasmcpu
unit - before:
...
.Lj1873:
...
strb r0,[r11, #-60]
uxtb r0,r0
sub r0,r0,#1
strb r0,[r11, #-60]
...
After - r0
is deallocated after the second STRB
instruction, so its upper 24 bits have no bearing on the result and don't affect the lower 8 bits in the subtraction:
...
.Lj1873:
...
strb r0,[r11, #-60]
sub r0,r0,#1
strb r0,[r11, #-60]
...
A similar thing happens in the aasmtai
unit, albeit with an OR
operation - before:
...
.Lj172:
...
strb r0,[r4, #38]
uxtb r0,r0
orr r0,r0,#4
strb r0,[r4, #38]
...
After (it's interesting to note that in both of these cases, the first STRB
is a dead-store; this might warrant further research):
...
.Lj172:
...
strb r0,[r4, #38]
orr r0,r0,#4
strb r0,[r4, #38]
...
In fpide
, an example with ADD
appears - before:
...
.section .text.n_fpide_$$_inctargetedeventptr$smallint$$smallint,"ax"
...
sxth r0,r0
add r0,r0,#1
sxth r0,r0
...
After - thanks to the second SXTH
and the fact that the upper 16 bits have no bearing on the lower 16 bits during the addition, the first SXTH
can be removed:
...
.section .text.n_fpide_$$_inctargetedeventptr$smallint$$smallint,"ax"
...
add r0,r0,#1
sxth r0,r0
...
In fpwritepng
, an example with UXTH
and AND
against 65280 appears (a large immediate that, for once, can be stored inside the ARM CPU's barrel shifter) - before:
...
.section .text.n_fpwritepng$_$tfpwriterpng_$__$$_colordatacolorb$tfpcolor$$qword,"ax"
...
mov r0,r1,lsr #16
uxth r0,r0
and r0,r0,#65280
add r0,r0,r3
...
After:
.section .text.n_fpwritepng$_$tfpwriterpng_$__$$_colordatacolorb$tfpcolor$$qword,"ax"
...
.section .text.n_fpwritepng$_$tfpwriterpng_$__$$_colordatacolorb$tfpcolor$$qword,"ax"
...
mov r0,r1,lsr #16
and r0,r0,#65280
add r0,r0,r3
...
In the pyacc
unit, the removal of a SXTH
instruction allows the combination of two arithmetic operations that were either side of it - before:
...
.Lj733:
add r0,r4,#1
sxth r0,r0
ldr r1,.Lj712
add r0,r0,#1
strh r0,[r1]
...
After (even if r4
contains information in the upper 16 bits, it has no bearing on the lower 16 bits of r0
that gets stored via the STRH
instruction):
...
.Lj733:
ldr r1,.Lj712
add r0,r4,#2
strh r0,[r1]
...