Skip to content

[AArch64] MOV(Z) reg1,const / op reg2,reg1 Deep Optimisation


This merge request mark the beginning of some deep peephole optimisations that seek to optimise code were a MOV reg,const (orMOVZ) instruction exists and whose destination register is then used in an arithmetic operation by seeing if the constant can be encoded directly in said instruction. A new helper function, named is_arith_const has been programmed to evaluate whether the constant can be encoded in an ADD, SUB, CMN or ```CMP`` instruction (the numbers 0 to 4095, and 4096 to 16773120 in steps of 4096).

Other instructions, like logical instructions, will be added at a later date, as well optimisations that deal with MOV reg1,reg2 instructions.

Additionally, a node-level optimisation has been implemented with min/max nodes that makes use of is_arith_const and attempts to directly create CMP reg,const instructions if the second operand is a constant, thus taking some strain off the peephole optimizer at low optimisation settings and potentially reducing the number of required iterations at higher optimisation settings.


  • Operating system: Linux (Raspberry Pi OS) and others
  • Processor architecture: AArch64
  • Device: Raspberry Pi and others

What is the current bug behavior?


What is the behavior after applying this patch?

Code is made smaller and faster in situations where constants are moved into registers.

Relevant logs and/or screenshots

For a simple one to start off with: in the cgbase unit (aarch64-linux, -O4) - before:

.section .text.n_cgbase_$$_initmms$pmmshuffle$shortint,"ax"
	stp	x19,x20,[sp, #-16]!
	mov	x19,x0
	sxtb	w20,w1
	mov	w0,w20
	mov	w1,wzr
	cmp	w0,w1
	csel	w0,w0,w1,gt

After - the CMP instruction gets changed from cmp w0,w1 to cmp w0,#0 since w1 is equal to zero (changed to wzr by the peephole optimizer elsewhere):

.section .text.n_cgbase_$$_initmms$pmmshuffle$shortint,"ax"
	stp	x19,x20,[sp, #-16]!
	mov	x19,x0
	sxtb	w20,w1
	mov	w0,w20
	mov	w1,wzr
	cmp	w0,#0
	csel	w0,w0,w1,gt

Future optimisations will deal with the fact that w0 = w20 and w1 = wzr throughout much of the block.

In the regexpr unit - before:

.section .text.n_regexpr_$$__uppercase$ansichar$$ansichar,"ax"
	movz	w0,#32
	sub	w20,w20,w0
	uxtb	w20,w20
	b	.Lj64

After - the number 32 can be encoded in SUB instructions, and since w0 is deallocated after this instruction, the original MOVZ gets removed as well:

.section .text.n_regexpr_$$__uppercase$ansichar$$ansichar,"ax"
	sub	w20,w20,#32
	uxtb	w20,w20
	b	.Lj64

In the uhpackimp unit - multple supporting peephole optimizations come into play - before:

.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
	str	wzr,[x19, #36]
	mov	w1,wzr
	add	w0,w0,w1
	str	w0,[x19, #32]

Intermediate - since wzr is treated like 0 in the new optimisation, the value replaces w1 in the ADD instruction and then the original MOV gets removed since w1 is deallocated:

.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
	str	wzr,[x19, #36]
	add	w0,w0,wzr
	str	w0,[x19, #32]

After - an additional new peephole optimisation removes add reg,reg,#0/wzr and sub reg,reg,#0/wzr instructions since they are identity operations, and then the two STR instructions are merged into an STP Instruction since they read contiguous memory:

.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
	stp	w0,wzr,[x19, #32]

Merge request reports