Skip to content
Snippets Groups Projects

[ARM / AArch64] AND/CMP -> TST optimisation and supporting code

Summary

This merge request aims to optimise the intermediate assembly language produced under Arm and AArch64 by optimising AND/CMP pairs into TST instructions and eliminating the temporary register used (the peephole optimizer attempts to remove the relevant allocation and deallocation entries), thereby reducing instruction count, increasing speed and permitting more future optimisations. It is split into five distinct parts:

  • First, some minor refactoring and utility functions
    • First, a general-purpose TryRemoveRegAlloc routine that attempts to remove the allocation and deallocation entries for a particular register between two instructions.
    • FindRegAllocBackward has been modified so it returns nil if the first tai_regalloc that it finds for the given register is a deallocation. This allows optimisations that use this function to be simplified by removing the need to check to see if the returned object is not a deallocation (also improving maintainability by eliminating bugs where this check might be forgotten).
    • The optimisation routines OptPostCMP and OptPostAnd for Arm and AArch64 have been renamed to PostPeepholeOptCMP and PostPeepholeOptAND for internal consistency.
  • A new optimisation that converts AND/CMP into either TST or ANDS instructions, depending on if the temporary register is used afterwards (AND/CMP to ANDS is already performed by the OpCmp2OpS optimisation, but since all of the checks have been performed by this point, it would just waste time to not change it to ANDS at this point).
  • A new PostPeepholeOptTST routine has been introduced that aims to convert TST/B.c pairs into TBZ or TBNZ instructions. This is to compensate for the fact that AND/CMP pairs have been simplified and will no longer be picked up by PostPeepholeOptAND.
  • The PostPeepholeOptAND routine has been removed since it no longer catches the optimisation it was written for (PostPeepholeOptTST and the existing PostPeepholeOptCMP fulfil these roles. See previous point).
  • A new OptPass2TST routine aims to catch a more uncommon optimisation where AND/CMP is converted into TST, but an identical AND appears after the conditional jump, thus a saving can be made if the new TST instruction is converted into ANDS (and the register that the AND instruction writes to isn't in use).
  • RegInInstruction, RegModifiedByInstruction and RegLoadedWithNewValue have been upgraded on ARM and AArch64 to better handle the flags register.
  • OptPass2AND and OptPass2CMP will attempt to remove vestigial CMP and TST routines (i.e. those that no longer have conditional statements following them).

Additionally, due to a particular occurrence in the Classes unit, the new AND; CMP -> TST optimisation will look out for AND; CMP; B.c; AND combinations, where the second AND has the same inputs as the first AND (the destination register can be the same or different - for the former it is removed completely, and with the latter it's changed to MOV. In both cases, the first AND is changed to ANDS).

System

  • Operating system: Linux (Raspberry Pi OS)
  • Processor architecture: Arm, AArch64
  • Device: Raspberry Pi 400

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Some conditional code is now more efficient.

Relevant logs and/or screenshots

In the cclasses unit under aarch64-linux (-O4) - before:

.section .text.n_cclasses$_$tbitset_$__$$_isset$longint$$boolean,"ax"
	.balign 8
.globl	CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN
	.type	CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN,@function
CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN:
.Lc804:
	...
	and	w0,w0,#1
	cmp	w0,#0
	cset	w0,ne
	b	.Lj1274

After:

.section .text.n_cclasses$_$tbitset_$__$$_isset$longint$$boolean,"ax"
	.balign 8
.globl	CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN
	.type	CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN,@function
CCLASSES$_$TBITSET_$__$$_ISSET$LONGINT$$BOOLEAN:
.Lc804:
	...
	tst	w0,#1
	cset	w0,ne
	b	.Lj1274

In the sysutils unit - before:

	...
.Lj163:
	...
	and	x1,x1,#1
	cbz	w1,.Lj162
.Lj166:
	...

After - in this situation, the post-peephole optimisation is different. Originally it was and/cmp/b.eq and was changed to and/cbz, but now it's tst/b.eq (with tst made from converting and/cmp) being changed to tbz:

	...
.Lj163:
	...
	tbz	x1,#0,.Lj162
.Lj166:
	...

A common change that appears under aarch64-linux (-O4) in the classes unit - before:

	...
.Lj42:
	and	x0,x20,#63
	cbz	x0,.Lj46
	...

After:

	...
.Lj42:
	tst	x20,#63
	b.eq	.Lj46
	...

Though the instruction count is the same, it reduces the number of registers used.

For the OptPass2TST optimisation, this is to catch a couple of instances where after an AND/CMP/B.c triplet is converted into TST/B.c pair, an instruction identical to the original AND appears after the conditional jump, most likely with the same destination register as the one that was deallocated. In this instance, converting the TST into an ANDS removes an additional instruction. For the classes example - before:

	...
.Lj42:
	and	x0,x20,#63
	cbz	x0,.Lj46
	and	x0,x20,#63
	...

After (without OptPass2TST):

	...
.Lj42:
	tst	x20,#63
	b.eq	.Lj46
	and	x0,x20,#63
	...

After (with OptPass2TST):

	...
.Lj42:
	ands	x0,x20,#63
	b.eq	.Lj46
	...

The fpttfsubsetter unit has a similar example.


To show arm-linux some love - in the classes unit - before:

	...
.Lj137:
	ldr	r1,[r0, #12]
	ands	r1,r1,#31
	beq	.Lj142
	ldr	r2,[r0, #12]
	and	r1,r2,#31
	...

AFter:

	...
.Lj137:
	ldr	r1,[r0, #12]
	tst	r1,#31
	beq	.Lj142
	ldr	r2,[r0, #12]
	and	r1,r2,#31
	...

Turning ANDS into TST is a common change, and while it doesn't give a direct saving (unless you count the microwatts of power saved from not writing to a register), it can permit future optimisations. In this case, r1 will retain the value referenced by [r0, #12], so the second LDR instruction could be changed to mov r2,r1 and permit some register simplification (e.g. by changing what is now a MOV/AND combination to and r1,r1,#31.

Edited by J. Gareth "Kit" Moreton

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • J. Gareth "Kit" Moreton changed title from And cmp ands optimisation to [ARM / AArch64] AND/CMP -> TST optimisation and supporting code

    changed title from And cmp ands optimisation to [ARM / AArch64] AND/CMP -> TST optimisation and supporting code

  • added 15 commits

    • e1ac8666...55e72fc0 - 9 commits from branch freepascal.org/fpc:main
    • 960fc07d - * New "TryRemoveRegAlloc" optimisation utility
    • c1d39d07 - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • 39993aef - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • b6038d43 - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • 3124a43c - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • ed39986b - * arm/a64: New "OptPass1Tst" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • J. Gareth "Kit" Moreton changed the description

    changed the description

  • J. Gareth "Kit" Moreton marked this merge request as draft

    marked this merge request as draft

  • added 7 commits

    • 93820704 - 1 commit from branch freepascal.org/fpc:main
    • 77928ab4 - * New "TryRemoveRegAlloc" optimisation utility
    • b44f6d91 - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • c915803d - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • 20522ecd - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • aa694b43 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • 68d9069b - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • J. Gareth "Kit" Moreton marked this merge request as ready

    marked this merge request as ready

  • J. Gareth "Kit" Moreton resolved all threads

    resolved all threads

  • To give some extra love to Arm, the OptPass2TST optimisation can do some work.

    In the classes unit (arm-linux -O4) - before:

    	...
    .Lj49:
    	str	r6,[r4, #8]
    .Lj47:
    	ands	r0,r5,#31
    	beq	.Lj51
    	and	r0,r5,#31
    	mov	r1,#1
    	...

    The AND instruction gets removed since r0 already contains its calculation - after:

    	...
    .Lj49:
    	str	r6,[r4, #8]
    .Lj47:
    	ands	r0,r5,#31
    	beq	.Lj51
    	mov	r1,#1
    	...
  • added 1 commit

    • a9e9e786 - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • added 7 commits

    • 44cda176 - 1 commit from branch freepascal.org/fpc:main
    • 3b5b539e - * New "TryRemoveRegAlloc" optimisation utility
    • 50f8db35 - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • 6d3a1a1a - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • c35900b8 - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • c219add2 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • 71294bcc - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • added 77 commits

    • 71294bcc...854d944c - 71 commits from branch freepascal.org/fpc:main
    • 300de14f - * New "TryRemoveRegAlloc" optimisation utility
    • 4c336dee - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • f66b3c61 - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • d4d33147 - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • 1932e886 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • d4b4f449 - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • added 36 commits

    • d4b4f449...cc3f4508 - 30 commits from branch freepascal.org/fpc:main
    • 693f2d4a - * New "TryRemoveRegAlloc" optimisation utility
    • 06d4126a - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • 4d70b0fc - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • 438129a6 - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • 9ae7ee25 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • c267c233 - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • added 19 commits

    • c267c233...9f62b33e - 13 commits from branch freepascal.org/fpc:main
    • df749344 - * New "TryRemoveRegAlloc" optimisation utility
    • a451c760 - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • 68ecc175 - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • 5724d3bd - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • 0c589c94 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • a6c82c9a - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • added 14 commits

    • a6c82c9a...c9b88a1c - 8 commits from branch freepascal.org/fpc:main
    • 31b5dac1 - * New "TryRemoveRegAlloc" optimisation utility
    • 4e7ee1f7 - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • 3383e352 - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • 52209f6c - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • 9d8eb727 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • 0bbcd5c6 - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • added 43 commits

    • 0bbcd5c6...363bc3e0 - 37 commits from branch freepascal.org/fpc:main
    • 35b6afec - * New "TryRemoveRegAlloc" optimisation utility
    • 58b936c3 - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • 341e0f17 - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • f62bf0f5 - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • f76c93f4 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • c4637da7 - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

  • FPK added 10 commits

    added 10 commits

    • c4637da7...f3d93a47 - 4 commits from branch freepascal.org/fpc:main
    • e84a50a8 - * New "TryRemoveRegAlloc" optimisation utility
    • 38eabb91 - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • 8b97323d - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • 89eea0f7 - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • db831396 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • a2bc5722 - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation

    Compare with previous version

    • Owner
      Resolved by J. Gareth "Kit" Moreton

      I get a lot of new failures on arm-linux after applying it?

      0a1,2
      > ../packages/fcl-json/tests/testjson
      > ../packages/pastojs/tests/testpas2js
      1a4,7
      > test/cg/taddcurr
      > test/cg/taddreal1
      > test/cg/taddreal2
      > test/cg/taddreal3
      6d11
      < test/tcpstr27
      7a13
      > test/tfma1arm
      11a18
      > test/toperator1
      18a26,29
      > test/units/system/tstr1
      > test/units/sysutils/tfloattostr
      > tbs/tb0005
      > tbs/tb0599
      19a31
      > tbs/tb0683
      22c34,37
      < webtbs/tw12993
      ---
      > webtbs/tw10791
      > webtbs/tw12894
      > webtbs/tw13552
      > webtbs/tw16040
      23a39,40
      > webtbs/tw26993
      > webtbs/tw26993a
      25a43,46
      > webtbs/tw35626
      > webtbs/tw37397
      > webtbs/tw38129
      > webtbs/tw3833
      26a48,49
      > webtbs/tw40041
      > webtbs/tw8148
  • added 33 commits

    • a2bc5722...39d2035d - 24 commits from branch freepascal.org/fpc:main
    • b968606f - * New "TryRemoveRegAlloc" optimisation utility
    • bd672cb5 - * FindRegAllocBackward will now return nil if it hits a dealloc for the register first
    • d3538918 - * a64: Renamed OptPostCMP/And to PostPeepholeOptCMP/AND for internal consistency
    • e40b9a26 - * arm: Fixed "RegInInstruction" and "RegModifiedByInstruction" not handling the flags properly
    • e302dfef - * arm/a64: New AND/CMP -> TST or ANDS optimisation
    • 159aa985 - * arm/a64: New "OptPass2TST" routine to catch "TST; B.c; AND -> ANDS; B.c" optimisation
    • 2fd0e448 - * arm/a64: Added new TST post-peephole optimisation to replace previous AND/CMP/B(c) optimisation
    • 9835301b - * arm: "OpCmp2OpS" moved to Pass 2 so it doesn't conflict with AND; CMP -> TST optimisation
    • 8c9deeba - * arm/a64: "OptPass2AND" and "OptPass2CMP" adapted to remove vestigial CMP and TST instructions

    Compare with previous version

  • J. Gareth "Kit" Moreton marked this merge request as draft

    marked this merge request as draft

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading