Skip to content

[ARM/AArch64] Single-instruction MOV sanitation

Summary

This merge request has two distinct optimisations that convert lone instructions into MOV equivalents:

  • movz reg,#0, when by itself (not part of a movz/movk pair, for example), is changed into mov reg,xzr (or mov reg,wzr, depending on the register size).
  • sbfx and ubfx instructions get converted to mov instructions if the lsb and width fields cover the entire register (performed in the pre-peephole stage).

(And also changing a debug message slightly so it follows the standard)

System

  • Operating system: Linux (Raspberry Pi OS) and others
  • Processor architecture: ARM, AArch64
  • Device: Raspberry Pi 400 (and others)

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

New optimisations should now sometimes be made as a result of an optimisation cascade.

Relevant logs and/or screenshots

Though the optimisations themselves don't reduce instruction size or increase speed, they do sometimes open up other optimisations. One profound example can be found in the TStream.Seek method of the Classes unit - before:

	...
 .Lj308:
	...
	mov	x4,x20
	mov	x0,sp
	ldr	x5,[sp]
	ldr	x5,[x5, #280]
	mov	x6,sp
	ubfx	x0,x5,#0,#64
	ubfx	x5,x3,#0,#64
	cmp	x0,x5
	b.ne	.Lj314
	movz	x5,#0
	movz	x0,#0
	mov	x3,x0
	mov	x4,x5
.Lj314:
	ubfx	x0,x3,#0,#64
	...

After:

	...
.Lj308:
	...
	mov	x4,x20
	ldr	x5,[sp]
	ldr	x0,[x5, #280]
	mov	x6,sp
	mov	x5,x3
	cmp	x0,x5
	b.ne	.Lj314
	mov	x4,xzr
	mov	x3,xzr
.Lj314:
	mov	x0,x3
	...

Converting the first two ubfx instructions opened up a simplification of ldr x5,[x5, #280], causing it to be changed into ldr x0,[x5, #280] and the complete removal of what is now mov x0,x5, since x5 gets overwritten on the very next instruction (the cmp instruction could be changed to cmp x0,x3 to possibly allow the removal of x5 completely, but we'll deal with that another day).

Meanwhile, changing movz reg,#0 to mov reg,xzr permitted an additional optimisation after observing that x4 is set to x5, and x3 is set to x0, and then x0 and x5 are discarded - it's generally easier to track matching registers than immediate values, and there is future potential to optimise this further, from:

	...
	b.ne	.Lj314
	mov	x4,xzr
	mov	x3,xzr
.Lj314:
	...

To:

	...
	csel x4,xzr,x4,ne
	csel x3,xzr,x3,ne
	...

With careful observation of the code that appears prior, x4 gets set to x20, so this can be improved further:

	...
	csel x4,xzr,x20,ne
	csel x3,xzr,x3,ne
	...
Edited by J. Gareth "Kit" Moreton

Merge request reports