Skip to content

[AArch64] CSEL (conditional MOV) generation

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:b-csel into main

Summary

This merge request, based on the x86 CMOV overhaul over at !532 (merged), upgrades the AArch64 peephole optimizer to generate CSEL instructions to reduce branching and generate much more efficient code (which is also generally smaller as well - for comparison, the aarch64-linux compiler built under "-O4 -dDEBUG_AOPTCPU" is 4,507,976 bytes in size on the trunk, and 4,487,752 bytes with this patch... that's considering the larger code base that comes with this new optimisation).

System

  • Operating system: Linux (Raspberry Pi OS) and others
  • Processor architecture: AArch64
  • Device: Raspberry Pi and others

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Many conditional branches in the generated assembly language will be made shorter and faster via the use of CSEL instructions instead of branches.

Relevant logs and/or screenshots

Under aarch64-linux -O4, even small units receive a gain, and it's probably easier to list the units that don't get changed!

To list some simple examples in the math unit, let us begin with the 64-bit Max function (the 32-bit version seems to have a CSEL function generated at a higher stage... something to research at a later date) - before:

.section .text.n_math_$$_min$int64$int64$$int64,"ax"
	.balign 8
.globl	MATH_$$_MIN$INT64$INT64$$INT64
	.type	MATH_$$_MIN$INT64$INT64$$INT64,@function
MATH_$$_MIN$INT64$INT64$$INT64:
.Lc618:
	cmp	x0,x1
	b.lt	.Lj693
	mov	x0,x1
.Lj693:
	ret
.Lc617:
.Le178:
	.size	MATH_$$_MIN$INT64$INT64$$INT64, .Le178 - MATH_$$_MIN$INT64$INT64$$INT64

After:

.section .text.n_math_$$_min$int64$int64$$int64,"ax"
	.balign 8
.globl	MATH_$$_MIN$INT64$INT64$$INT64
	.type	MATH_$$_MIN$INT64$INT64$$INT64,@function
MATH_$$_MIN$INT64$INT64$$INT64:
.Lc618:
	cmp	x0,x1
// Peephole Optimization: CSEL Block (Simple type)
	csel	x0,x1,x0,ge
	ret
.Lc617:
.Le178:
	.size	MATH_$$_MIN$INT64$INT64$$INT64, .Le178 - MATH_$$_MIN$INT64$INT64$$INT64

An example where none of the CSEL's inputs equal the output is the IfThen function in the Math unit, which simulates C's ternary operator with a Boolean input and LongInt output - before:

.section .text.n_math_$$_ifthen$boolean$longint$longint$$longint,"ax"
	.balign 8
.globl	MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT
	.type	MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT,@function
MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT:
.Lc704:
	uxtb	w3,w0
	mov	w0,w1
// Peephole Optimization: CMPB.E/NE2CBNZ/CBZ done
	cbnz	w3,.Lj821
	mov	w0,w2
.Lj821:
	ret
.Lc703:
.Le212:
	.size	MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT, .Le212 - MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT

After:

.section .text.n_math_$$_ifthen$boolean$longint$longint$$longint,"ax"
	.balign 8
.globl	MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT
	.type	MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT,@function
MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT:
.Lc704:
	uxtb	w3,w0
// Peephole Optimization: MovCSel2CSel
	cmp	w3,#0
// Peephole Optimization: CSEL Block (Simple type)
	csel	w0,w2,w1,eq
	ret
.Lc703:
.Le212:
	.size	MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT, .Le212 - MATH_$$_IFTHEN$BOOLEAN$LONGINT$LONGINT$$LONGINT

In aasmcnst, a "double branching" to the same destination label is demonstrated, which eliminates a label and some instructions - before:

	...
.Lj361:
// Peephole Optimization: CMPB.E/NE2CBNZ/CBZ done
	cbz	w20,.Lj363
// Peephole Optimization: Movz0ToMovZeroReg
	mov	x0,xzr
	b	.Lj357
.Lj363:
	mov	x0,x1
	b	.Lj357
.Lj358:
// Peephole Optimization: Movz0ToMovZeroReg
	mov	x0,xzr
.Lj357:
	...

After:

	...
.Lj361:
	cmp	w20,#0
// Peephole Optimization: CSEL Block (Double branching (same) type)
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: MovCSel2CSel
	csel	x0,x1,xzr,eq
	b	.Lj357
.Lj358:
// Peephole Optimization: Movz0ToMovZeroReg
	mov	x0,xzr
.Lj357:
	...

The classes unit shows an example where more than one register is set this way - before:

	...
.section .text.n_classes$_$tbits_$__$$_equals$tbits$$boolean,"ax"
	cmp	x2,x3
	b.ge	.Lj183
	mov	x19,x0
	mov	x20,x1
	b	.Lj184
.Lj183:
	mov	x19,x1
	mov	x20,x0
.Lj184:
	...

After:

.section .text.n_classes$_$tbits_$__$$_equals$tbits$$boolean,"ax"
	...
	cmp	x2,x3
// Peephole Optimization: CSEL Block (Double type)
// Peephole Optimization: MovCSel2CSel
// Peephole Optimization: MovCSel2CSel
	csel	x19,x1,x0,ge
	csel	x20,x0,x1,ge
	...

Also in the classes unit, an example exists where a register is reserved to store a constant - before:

.section .text.n_classes$_$tstream_$__$$_read64$tbytes$int64$int64$$int64,"ax"
	...
.Lj222:
	sub	x22,x21,x23
// Peephole DataMov2Data removed superfluous mov
	mov	x0,#2147483647
	cmp	x22,x0
	b.le	.Lj226
	mov	x22,#2147483647
.Lj226:
	...

After - this can be improved if it can be detected that x0 already contains the desired value:

.section .text.n_classes$_$tstream_$__$$_read64$tbytes$int64$int64$$int64,"ax"
	...
	mov	x0,#2147483647
	mov	x1,#2147483647
	cmp	x22,x0
// Peephole Optimization: CSEL Block (Simple type)
	csel	x22,x1,x22,gt
	...

Also in the classes unit, a more long-range optimisation occurs via the new MovCSel2CSel optimisation - before (unrelated debug messages removed for clarity):

.section .text.n_classes$_$tstream_$__$$_seek$longint$word$$longint,"ax"
	...
	mov	x4,x19
	ldr	x5,[sp]
	ldr	x0,[x5, #280]
	mov	x6,sp
	mov	x5,x3
	cmp	x0,x5
	b.ne	.Lj264
	mov	x4,xzr
	mov	x3,xzr
.Lj264:
	...

After - x19 doesn't change value between the MOV and the conditional block, so it can be removed and the CSEL instruction that was csel x4,xzr,x4,eq changed to csel x4,xzr,x19,eq:

.section .text.n_classes$_$tstream_$__$$_seek$longint$word$$longint,"ax"
	...
// Peephole Optimization: MovCSel2CSel
	ldr	x5,[sp]
	ldr	x0,[x5, #280]
	mov	x6,sp
	mov	x5,x3
	cmp	x0,x5
// Peephole Optimization: CSEL Block (Simple type)
	csel	x4,xzr,x19,eq
	csel	x3,xzr,x3,eq
.Lj264:
	...

Future development

  • Branches that use MOVZ/MOVN/MOVK to set constants are currently not optimised. This will be subject to future development.
  • Currently, MOV instructions that are created to store simple constants are placed after CMP instructions and are not relocated to precede them (assuming the MOV instruction doesn't write to the register being compared). In x86, this is achieved via the TrySwapMovCmp optimisation routine, but this depends on x86-specific functionality, specifically InsProp in TrySwapMovOp. An AArch64 version will be developed at a later date.
  • The CSel2Mov optimisation may be deprecated at a later date since it optimises situations where a CSEL instruction is generated with the same true/false register, and is a sign of inefficient user code that can possibly be simplified at a higher level. This is demonstrated in the aasmcpu unit - before:
	...
.Lj511:
// Peephole Optimization: CMPB.E/NE2CBNZ/CBZ done
	cbnz	w1,.Lj516
// Peephole Optimization: Movz0ToMovZeroReg
	mov	w0,wzr
	b	.Lj508
.Lj516:
// Peephole Optimization: Movz0ToMovZeroReg
	mov	w0,wzr
	b	.Lj508
.Lj512:
	mov	w2,w1
	cmp	w2,#2
	b.hs	.Lj519
// Peephole Optimization: Movz0ToMovZeroReg
	mov	w0,wzr
	b	.Lj508
.Lj519:
// Peephole Optimization: Movz0ToMovZeroReg
	mov	w0,wzr
	b	.Lj508
.Lj513:
	...

After:

	...
.Lj511:
	cmp	w1,#0
// Peephole Optimization: CSEL Block (Double branching (same) type)
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: MovCSel2CSel
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: CSel2Mov (identical true/false registers)
	mov	w0,wzr
	b	.Lj508
.Lj512:
	mov	w2,w1
	cmp	w2,#2
// Peephole Optimization: CSEL Block (Double branching (same) type)
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: MovCSel2CSel
// Peephole Optimization: Movz0ToMovZeroReg
// Peephole Optimization: CSel2Mov (identical true/false registers)
	mov	w0,wzr
	b	.Lj508
.Lj513:
	...

Merge request reports