[ARM] New Bcc; CMP; Bcc -> CMP(~c); Bcc optimisation (!644) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:arm-conditional-ops into main Apr 06, 2024

Summary

This merge request seeks out simple comparisons that are logically OR'd together and branch to the same destination by removing the first jump and making the second comparison conditional. For example:

	cmp	r0,#2
	beq	.Lbl
	cmp	r0,#10
	beq	.Lbl

Becomes:

	cmp	r0,#2
	cmpne	r0,#10
	beq	.Lbl

This reduces code size while, for a long-term average, does not impact performance (generally about 50:50 whether it takes a cycle longer or a cycle fewer), while also removing the number of jumps.

System

Operating system: Linux (Raspberry Pi OS) and others
Processor architecture: ARM
Device: Raspberry Pi 400 and others

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Many pairs of CMP; Bcc instructions can get optimised from 4 instructions down to 3.

Additional Notes

When optimising for speed (the default setting), only two such comparisons are changed together, as chaining any more than that will lead to performance losses. However, when optimising for size, each conditional chaining of a comparison removes a branch instruction, resulting in a smaller code size.

Relevant logs and/or screenshots

A large number of units benefit from this optimisation. In aasmcpu (-O4, arm-linux) - before:

	...
.Lj985:
	...
	cmp	r0,#2
	beq	.Lj994
	cmp	r0,#8
	beq	.Lj994
	...

After:

	...
.Lj985:
	...
	cmp	r0,#2
	cmpne	r0,#8
	beq	.Lj994
	...

In the fbadmin unit, a pair of TST instructions get optimised instead - before:

.section .text.n_fbadmin$_$tfbadmin_$__$$_restore$ansistring$ansistring$tibrestoreoptions$ansistring$$boolean,"ax"
	...
	tst	r7,#32
	bne	.Lj299
	tst	r7,#64
	bne	.Lj299
	...

After (these TST instructions can probably be merged):

.section .text.n_fbadmin$_$tfbadmin_$__$$_restore$ansistring$ansistring$tibrestoreoptions$ansistring$$boolean,"ax"
	...
	tst	r7,#32
	tsteq	r7,#64
	bne	.Lj299
	...

Later on in the same subroutine - before:

	...
.Lj307:
	tst	r7,#256
	bne	.Lj309
	tst	r7,#512
	beq	.Lj311
.Lj309:
	tst	r7,#256
	beq	.Lj313
	...

After - this gets combined with standard jump optimisations (originally beq .Lj311 was bne .Lj309; b .Lj311 to remove a label (the third TST can probably be optimised out too if one is careful with the order of optimisations):

	...
.Lj307:
	tst	r7,#256
	tsteq	r7,#512
	beq	.Lj311
	tst	r7,#256
	beq	.Lj313
	...

In the system unit, a similar simplification is done with regular CMP instructions - before:

	...
.Lj4731:
	...
	cmp	r0,#9
	beq	.Lj4737
	cmp	r0,#32
	bne	.Lj4733
.Lj4737:
	ldr	r0,[r11]
	add	r0,r0,#1
	...