[ARM / AArch64] AND/CMP -> TST optimisation and supporting code (!516) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:and-cmp-ands-optimisation into main Oct 28, 2023

Summary

This merge request aims to optimise the intermediate assembly language produced under Arm and AArch64 by optimising AND/CMP pairs into TST instructions and eliminating the temporary register used (the peephole optimizer attempts to remove the relevant allocation and deallocation entries), thereby reducing instruction count, increasing speed and permitting more future optimisations. It is split into five distinct parts:

First, some minor refactoring and utility functions
- First, a general-purpose TryRemoveRegAlloc routine that attempts to remove the allocation and deallocation entries for a particular register between two instructions.
- FindRegAllocBackward has been modified so it returns nil if the first tai_regalloc that it finds for the given register is a deallocation. This allows optimisations that use this function to be simplified by removing the need to check to see if the returned object is not a deallocation (also improving maintainability by eliminating bugs where this check might be forgotten).
- The optimisation routines OptPostCMP and OptPostAnd for Arm and AArch64 have been renamed to PostPeepholeOptCMP and PostPeepholeOptAND for internal consistency.
A new optimisation that converts AND/CMP into either TST or ANDS instructions, depending on if the temporary register is used afterwards (AND/CMP to ANDS is already performed by the OpCmp2OpS optimisation, but since all of the checks have been performed by this point, it would just waste time to not change it to ANDS at this point).
A new PostPeepholeOptTST routine has been introduced that aims to convert TST/B.c pairs into TBZ or TBNZ instructions. This is to compensate for the fact that AND/CMP pairs have been simplified and will no longer be picked up by PostPeepholeOptAND.
The PostPeepholeOptAND routine has been removed since it no longer catches the optimisation it was written for (PostPeepholeOptTST and the existing PostPeepholeOptCMP fulfil these roles. See previous point).
A new OptPass2TST routine aims to catch a more uncommon optimisation where AND/CMP is converted into TST, but an identical AND appears after the conditional jump, thus a saving can be made if the new TST instruction is converted into ANDS (and the register that the AND instruction writes to isn't in use).
RegInInstruction, RegModifiedByInstruction and RegLoadedWithNewValue have been upgraded on ARM and AArch64 to better handle the flags register.
OptPass2AND and OptPass2CMP will attempt to remove vestigial CMP and TST routines (i.e. those that no longer have conditional statements following them).

Additionally, due to a particular occurrence in the Classes unit, the new AND; CMP -> TST optimisation will look out for AND; CMP; B.c; AND combinations, where the second AND has the same inputs as the first AND (the destination register can be the same or different - for the former it is removed completely, and with the latter it's changed to MOV. In both cases, the first AND is changed to ANDS).

System

Operating system: Linux (Raspberry Pi OS)
Processor architecture: Arm, AArch64
Device: Raspberry Pi 400

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Some conditional code is now more efficient.

Relevant logs and/or screenshots

In the cclasses unit under aarch64-linux (-O4) - before:

.section .text.n_cclasses$_$tbitset_$__$$_isset$longint$$boolean,"ax"
	.balign 8
.globl	CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN
	.type	CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN,@function
CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN:
.Lc804:
	...
	and	w0,w0,#1
	cmp	w0,#0
	cset	w0,ne
	b	.Lj1274

After:

.section .text.n_cclasses$_$tbitset_$__$$_isset$longint$$boolean,"ax"
	.balign 8
.globl	CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN
	.type	CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN,@function
CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN:
.Lc804:
	...
	tst	w0,#1
	cset	w0,ne
	b	.Lj1274

In the sysutils unit - before:

	...
.Lj163:
	...
	and	x1,x1,#1
	cbz	w1,.Lj162
.Lj166:
	...

After - in this situation, the post-peephole optimisation is different. Originally it was and/cmp/b.eq and was changed to and/cbz, but now it's tst/b.eq (with tst made from converting and/cmp) being changed to tbz:

	...
.Lj163:
	...
	tbz	x1,#0,.Lj162
.Lj166:
	...

A common change that appears under aarch64-linux (-O4) in the classes unit - before:

	...
.Lj42:
	and	x0,x20,#63
	cbz	x0,.Lj46
	...

After:

	...
.Lj42:
	tst	x20,#63
	b.eq	.Lj46
	...

Though the instruction count is the same, it reduces the number of registers used.

For the OptPass2TST optimisation, this is to catch a couple of instances where after an AND/CMP/B.c triplet is converted into TST/B.c pair, an instruction identical to the original AND appears after the conditional jump, most likely with the same destination register as the one that was deallocated. In this instance, converting the TST into an ANDS removes an additional instruction. For the classes example - before:

	...
.Lj42:
	and	x0,x20,#63
	cbz	x0,.Lj46
	and	x0,x20,#63
	...

After (without OptPass2TST):

	...
.Lj42:
	tst	x20,#63
	b.eq	.Lj46
	and	x0,x20,#63
	...

After (with OptPass2TST):

	...
.Lj42:
	ands	x0,x20,#63
	b.eq	.Lj46
	...

The fpttfsubsetter unit has a similar example.

To show arm-linux some love - in the classes unit - before:

	...
.Lj137:
	ldr	r1,[r0, #12]
	ands	r1,r1,#31
	beq	.Lj142
	ldr	r2,[r0, #12]
	and	r1,r2,#31
	...

AFter:

	...
.Lj137:
	ldr	r1,[r0, #12]
	tst	r1,#31
	beq	.Lj142
	ldr	r2,[r0, #12]
	and	r1,r2,#31
	...

Turning ANDS into TST is a common change, and while it doesn't give a direct saving (unless you count the microwatts of power saved from not writing to a register), it can permit future optimisations. In this case, r1 will retain the value referenced by [r0, #12], so the second LDR instruction could be changed to mov r2,r1 and permit some register simplification (e.g. by changing what is now a MOV/AND combination to and r1,r1,#31.

Edited Nov 27, 2023 by J. Gareth "Kit" Moreton

[ARM / AArch64] AND/CMP -> TST optimisation and supporting code

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Merge request reports