Skip to content

[AArch64] MOV(Z) reg1,const / op reg2,reg1 Deep Optimisation

Summary

This merge request mark the beginning of some deep peephole optimisations that seek to optimise code were a MOV reg,const (orMOVZ) instruction exists and whose destination register is then used in an arithmetic operation by seeing if the constant can be encoded directly in said instruction. A new helper function, named is_arith_const has been programmed to evaluate whether the constant can be encoded in an ADD, SUB, CMN or ```CMP`` instruction (the numbers 0 to 4095, and 4096 to 16773120 in steps of 4096).

Other instructions, like logical instructions, will be added at a later date, as well optimisations that deal with MOV reg1,reg2 instructions.

Additionally, a node-level optimisation has been implemented with min/max nodes that makes use of is_arith_const and attempts to directly create CMP reg,const instructions if the second operand is a constant, thus taking some strain off the peephole optimizer at low optimisation settings and potentially reducing the number of required iterations at higher optimisation settings.

System

  • Operating system: Linux (Raspberry Pi OS) and others
  • Processor architecture: AArch64
  • Device: Raspberry Pi and others

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Code is made smaller and faster in situations where constants are moved into registers.

Relevant logs and/or screenshots

For a simple one to start off with: in the cgbase unit (aarch64-linux, -O4) - before:

.section .text.n_cgbase_$$_initmms$pmmshuffle$shortint,"ax"
	...
.Lc96:
	stp	x19,x20,[sp, #-16]!
	mov	x19,x0
	sxtb	w20,w1
	mov	w0,w20
	mov	w1,wzr
	cmp	w0,w1
	csel	w0,w0,w1,gt
	...

After - the CMP instruction gets changed from cmp w0,w1 to cmp w0,#0 since w1 is equal to zero (changed to wzr by the peephole optimizer elsewhere):

.section .text.n_cgbase_$$_initmms$pmmshuffle$shortint,"ax"
	...
.Lc96:
	stp	x19,x20,[sp, #-16]!
	mov	x19,x0
	sxtb	w20,w1
	mov	w0,w20
	mov	w1,wzr
	cmp	w0,#0
	csel	w0,w0,w1,gt
	...

Future optimisations will deal with the fact that w0 = w20 and w1 = wzr throughout much of the block.

In the regexpr unit - before:

.section .text.n_regexpr_$$__uppercase$ansichar$$ansichar,"ax"
	...
	movz	w0,#32
	sub	w20,w20,w0
	uxtb	w20,w20
	b	.Lj64
.Lj66:
	...

After - the number 32 can be encoded in SUB instructions, and since w0 is deallocated after this instruction, the original MOVZ gets removed as well:

.section .text.n_regexpr_$$__uppercase$ansichar$$ansichar,"ax"
	...
	sub	w20,w20,#32
	uxtb	w20,w20
	b	.Lj64
.Lj66:
	...

In the uhpackimp unit - multple supporting peephole optimizations come into play - before:

.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
	...
.Lj606:
	str	wzr,[x19, #36]
	mov	w1,wzr
	add	w0,w0,w1
	str	w0,[x19, #32]
	...

Intermediate - since wzr is treated like 0 in the new optimisation, the value replaces w1 in the ADD instruction and then the original MOV gets removed since w1 is deallocated:

.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
	...
.Lj606:
	str	wzr,[x19, #36]
	add	w0,w0,wzr
	str	w0,[x19, #32]
	...

After - an additional new peephole optimisation removes add reg,reg,#0/wzr and sub reg,reg,#0/wzr instructions since they are identity operations, and then the two STR instructions are merged into an STP Instruction since they read contiguous memory:

.section .text.n_uhpackimp$_$thpackdynamictable_$__$$_setcapacity$longint,"ax"
	...
.Lj606:
	stp	w0,wzr,[x19, #32]
	...

Merge request reports