[ARM / AArch64] Extended the AND; CMP -> ANDS family of optimisations to also optimise other bitwise operations
Summary
This merge request extends the arm peephole optimisations that convert AND
/ CMP
pairs into ANDS
or TST
instructions. They also also check for BIC
(and not) in place of AND
, since under AArch64, BIC
instructions can set the flags much like AND
can.
Under 32-bit ARM, many more instructions can set the flags, so ORR
, EOR
and ORN
operations are also optimised this way.
System
- Operating system: Linux (Raspberry Pi OS) and others
- Processor architecture: arm, AArch64
- Device: Raspberry Pi 400
What is the current bug behavior?
N/a
What is the behavior after applying this patch?
BIC
/ CMP
pairs are now optimised into BICS
instructions.
Relevant logs and/or screenshots
Actual size savings are minimal at best because while CMP
instructions get removed, they are frequently cmp r1,#0
and the like, which tend to get merged with the succeeding conditional jump to produce cbz
```cbnz``` instructions. At the very least though, code should not be slower since it removes any kind of read-after-write penalty with the register (although it had to read from the flags instead), and there may be real-world examples where the comparison is not against zero.
Under AArch64, two examples exist within the compiler and RTL:
In the system
unit (-O4, aarch64-linux) - before:
.section .text.n_fpc_varset_contains_sets,"ax"
,,,
.Lj1165:
ldr x6,[x5]
ldr x3,[x4]
bic x3,x3,x6
cbnz x3,.Lj1160
,,,
After:
.section .text.n_fpc_varset_contains_sets,"ax"
...
.Lj1165:
ldr x6,[x5]
ldr x3,[x4]
bics xzr,x3,x6 ; // <-- writing to xzr since x3 is not used afterwards, simulating a "tst" pseudo-instruction ("ands xzr,...")
b.ne .Lj1160
...
In the cgcpu
unit - before:
.section .text.n_cgcpu$_$tcgaarch64_$__$$_a_op_const_reg_reg_checkoverflow$h$vgjqpygugh,"ax"
...
.Lj570:
sxtw x0,w22
ldr x1,[sp, #24]
bic x0,x1,x0
cbz x0,.Lj542
...
After:
.section .text.n_cgcpu$_$tcgaarch64_$__$$_a_op_const_reg_reg_checkoverflow$h$vgjqpygugh,"ax"
...
.Lj570:
sxtw x0,w22
ldr x1,[sp, #24]
bics xzr,x1,x0
b.eq .Lj542
...
Under 32-bit ARM, currently the compiler does not generate BIC
instructions, but many optimisations with ORR
instructions appear. For examplle, in the fmtbcd
unit (-O4, arm-linux) - before:
.section .text.n_fmtbcd_$$_varianttobcd$tvardata$$tbcd,"ax"
...
.Lj1744:
ldrb r1,[r5]
ldrb r0,[r5, #1]
orr r1,r1,r0,lsl #8
cmp r1,#0
beq .Lj1748
...
After:
.section .text.n_fmtbcd_$$_varianttobcd$tvardata$$tbcd,"ax"
...
.Lj1744:
ldrb r1,[r5]
ldrb r0,[r5, #1]
orrs r1,r1,r0,lsl #8
beq .Lj1748
...
The ptc
unit has a more long-range example - before:
.section .text.n_ptc_$$_freememandnil$formal,"ax"
...
orr r0,r0,r3,lsl #24
mov r2,#0
strb r2,[r1]
mov r2,r2,lsr #8
strb r2,[r1, #1]
mov r2,r2,lsr #8
strb r2,[r1, #2]
mov r2,r2,lsr #8
strb r2,[r1, #3]
cmp r0,#0
beq .Lj8
...
After - since the flags aren't modified between the ORR
instruction and the CMP
instruction, the latter can be removed and the former changed to ORRS
even though there's a fair bit of data manipulation in between (an inefficient way of writing an unaligned 32-bit zero to whetever r1
is pointing to):
.section .text.n_ptc_$$_freememandnil$formal,"ax"
...
orrs r0,r0,r3,lsl #24
mov r2,#0
strb r2,[r1]
mov r2,r2,lsr #8
strb r2,[r1, #1]
mov r2,r2,lsr #8
strb r2,[r1, #2]
mov r2,r2,lsr #8
strb r2,[r1, #3]
beq .Lj8
...