[ARM / AArch64] Extended the AND; CMP -> ANDS family of optimisations to also optimise other bitwise operations (!641) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:a64-bics into main Apr 04, 2024

Summary

This merge request extends the arm peephole optimisations that convert AND / CMP pairs into ANDS or TST instructions. They also also check for BIC (and not) in place of AND, since under AArch64, BIC instructions can set the flags much like AND can.

Under 32-bit ARM, many more instructions can set the flags, so ORR, EOR and ORN operations are also optimised this way.

System

Operating system: Linux (Raspberry Pi OS) and others
Processor architecture: arm, AArch64
Device: Raspberry Pi 400

What is the current bug behavior?

N/a

What is the behavior after applying this patch?

BIC / CMP pairs are now optimised into BICS instructions.

Relevant logs and/or screenshots

Actual size savings are minimal at best because while CMP instructions get removed, they are frequently cmp r1,#0 and the like, which tend to get merged with the succeeding conditional jump to produce cbz```cbnz``` instructions. At the very least though, code should not be slower since it removes any kind of read-after-write penalty with the register (although it had to read from the flags instead), and there may be real-world examples where the comparison is not against zero.

Under AArch64, two examples exist within the compiler and RTL:

In the system unit (-O4, aarch64-linux) - before:

.section .text.n_fpc_varset_contains_sets,"ax"
	,,,
.Lj1165:
	ldr	x6,[x5]
	ldr	x3,[x4]
	bic	x3,x3,x6
	cbnz	x3,.Lj1160
	,,,

After:

.section .text.n_fpc_varset_contains_sets,"ax"
	...
.Lj1165:
	ldr	x6,[x5]
	ldr	x3,[x4]
	bics	xzr,x3,x6 ; // <-- writing to xzr since x3 is not used afterwards, simulating a "tst" pseudo-instruction ("ands xzr,...")
	b.ne	.Lj1160
	...

In the cgcpu unit - before:

.section .text.n_cgcpu$_$tcgaarch64_$__$$_a_op_const_reg_reg_checkoverflow$h$vgjqpygugh,"ax"
	...
.Lj570:
	sxtw	x0,w22
	ldr	x1,[sp, #24]
	bic	x0,x1,x0
	cbz	x0,.Lj542
	...

After:

.section .text.n_cgcpu$_$tcgaarch64_$__$$_a_op_const_reg_reg_checkoverflow$h$vgjqpygugh,"ax"
	...
.Lj570:
	sxtw	x0,w22
	ldr	x1,[sp, #24]
	bics	xzr,x1,x0
	b.eq	.Lj542
	...

Under 32-bit ARM, currently the compiler does not generate BIC instructions, but many optimisations with ORR instructions appear. For examplle, in the fmtbcd unit (-O4, arm-linux) - before:

.section .text.n_fmtbcd_$$_varianttobcd$tvardata$$tbcd,"ax"
	...
.Lj1744:
	ldrb	r1,[r5]
	ldrb	r0,[r5, #1]
	orr	r1,r1,r0,lsl #8
	cmp	r1,#0
	beq	.Lj1748
	...

After:

.section .text.n_fmtbcd_$$_varianttobcd$tvardata$$tbcd,"ax"
	...
.Lj1744:
	ldrb	r1,[r5]
	ldrb	r0,[r5, #1]
	orrs	r1,r1,r0,lsl #8
	beq	.Lj1748
	...

The ptc unit has a more long-range example - before:

.section .text.n_ptc_$$_freememandnil$formal,"ax"
	...
	orr	r0,r0,r3,lsl #24
	mov	r2,#0
	strb	r2,[r1]
	mov	r2,r2,lsr #8
	strb	r2,[r1, #1]
	mov	r2,r2,lsr #8
	strb	r2,[r1, #2]
	mov	r2,r2,lsr #8
	strb	r2,[r1, #3]
	cmp	r0,#0
	beq	.Lj8
	...

After - since the flags aren't modified between the ORR instruction and the CMP instruction, the latter can be removed and the former changed to ORRS even though there's a fair bit of data manipulation in between (an inefficient way of writing an unaligned 32-bit zero to whetever r1 is pointing to):

.section .text.n_ptc_$$_freememandnil$formal,"ax"
	...
	orrs	r0,r0,r3,lsl #24
	mov	r2,#0
	strb	r2,[r1]
	mov	r2,r2,lsr #8
	strb	r2,[r1, #1]
	mov	r2,r2,lsr #8
	strb	r2,[r1, #2]
	mov	r2,r2,lsr #8
	strb	r2,[r1, #3]
	beq	.Lj8
	...

Edited Apr 05, 2024 by J. Gareth "Kit" Moreton

[ARM / AArch64] Extended the AND; CMP -> ANDS family of optimisations to also optimise other bitwise operations

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Merge request reports