[ARM] New Bcc; CMP; Bcc -> CMP(~c); Bcc optimisation
Summary
This merge request seeks out simple comparisons that are logically OR'd together and branch to the same destination by removing the first jump and making the second comparison conditional. For example:
cmp r0,#2
beq .Lbl
cmp r0,#10
beq .Lbl
Becomes:
cmp r0,#2
cmpne r0,#10
beq .Lbl
This reduces code size while, for a long-term average, does not impact performance (generally about 50:50 whether it takes a cycle longer or a cycle fewer), while also removing the number of jumps.
System
- Operating system: Linux (Raspberry Pi OS) and others
- Processor architecture: ARM
- Device: Raspberry Pi 400 and others
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Many pairs of CMP; Bcc
instructions can get optimised from 4 instructions down to 3.
Additional Notes
When optimising for speed (the default setting), only two such comparisons are changed together, as chaining any more than that will lead to performance losses. However, when optimising for size, each conditional chaining of a comparison removes a branch instruction, resulting in a smaller code size.
Relevant logs and/or screenshots
A large number of units benefit from this optimisation. In aasmcpu
(-O4, arm-linux) - before:
...
.Lj985:
...
cmp r0,#2
beq .Lj994
cmp r0,#8
beq .Lj994
...
After:
...
.Lj985:
...
cmp r0,#2
cmpne r0,#8
beq .Lj994
...
In the fbadmin
unit, a pair of TST
instructions get optimised instead - before:
.section .text.n_fbadmin$_$tfbadmin_$__$$_restore$ansistring$ansistring$tibrestoreoptions$ansistring$$boolean,"ax"
...
tst r7,#32
bne .Lj299
tst r7,#64
bne .Lj299
...
After (these TST
instructions can probably be merged):
.section .text.n_fbadmin$_$tfbadmin_$__$$_restore$ansistring$ansistring$tibrestoreoptions$ansistring$$boolean,"ax"
...
tst r7,#32
tsteq r7,#64
bne .Lj299
...
Later on in the same subroutine - before:
...
.Lj307:
tst r7,#256
bne .Lj309
tst r7,#512
beq .Lj311
.Lj309:
tst r7,#256
beq .Lj313
...
After - this gets combined with standard jump optimisations (originally beq .Lj311
was bne .Lj309; b .Lj311
to remove a label (the third TST
can probably be optimised out too if one is careful with the order of optimisations):
...
.Lj307:
tst r7,#256
tsteq r7,#512
beq .Lj311
tst r7,#256
beq .Lj313
...
In the system
unit, a similar simplification is done with regular CMP
instructions - before:
...
.Lj4731:
...
cmp r0,#9
beq .Lj4737
cmp r0,#32
bne .Lj4733
.Lj4737:
ldr r0,[r11]
add r0,r0,#1
...
After:
...
.Lj4731:
...
cmp r0,#9
cmpne r0,#32
bne .Lj4733
ldr r0,[r11]
add r0,r0,#1
...