Skip to content

[ARM / AArch64] Minimisation of zero and sign extension instructions (UXTB etc.)

Summary

This merge request introduces a new peephole optimisation that attempts to remove unnecessary register extension operations (UXTB, UXTH, SXTB and SXTH) by analysing the instructions that follow to see if the extended bits have any bearing on the final result.

System

  • **Processor architecture: ARM, AArch64
  • Device: Tested on Raspberry Pi 4 and Raspberry Pi 400

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

The number of unnecessary zero and sign extension operations is significantly reduced (at least under -O3).

Relevant logs and/or screenshots

Under ARMv7A (arm-linux, -O4 -CpARMV7A -OpARMV7A (although these optimisations also take effect without specifying ARMv7-A), a large number of units receive improvements. To begin with a simple example, in the colortxt unit - before:

	...
.globl	COLORTXT$_$TCOLOREDTEXT_$__$$_GETTHECOLOR$$BYTE
	...
	ldrb	r4,[r0, #64]
	b	.Lj36
.Lj35:
	mov	r1,#1
	bl	VIEWS$_$TVIEW_$__$$_GETCOLOR$WORD$$WORD
	uxth	r0,r0
	and	r4,r0,#255
.Lj36:
	and	r0,r4,#255

After - due to the use of AND against the constant 255, everything above the least-significant 8 bits of r0 is masked out as it is written to r4:

	...
.globl	COLORTXT$_$TCOLOREDTEXT_$__$$_GETTHECOLOR$$BYTE
	...
	ldrb	r4,[r0, #64]
	b	.Lj36
.Lj35:
	mov	r1,#1
	bl	VIEWS$_$TVIEW_$__$$_GETCOLOR$WORD$$WORD
	and	r4,r0,#255
.Lj36:
	and	r0,r4,#255
	...

In the aasmcpu unit - before:

	...
.Lj1873:
	...
	strb	r0,[r11, #-60]
	uxtb	r0,r0
	sub	r0,r0,#1
	strb	r0,[r11, #-60]
	...

After - r0 is deallocated after the second STRB instruction, so its upper 24 bits have no bearing on the result and don't affect the lower 8 bits in the subtraction:

	...
.Lj1873:
	...
	strb	r0,[r11, #-60]
	sub	r0,r0,#1
	strb	r0,[r11, #-60]
	...

A similar thing happens in the aasmtai unit, albeit with an OR operation - before:

	...
.Lj172:
	...
	strb	r0,[r4, #38]
	uxtb	r0,r0
	orr	r0,r0,#4
	strb	r0,[r4, #38]
	...

After (it's interesting to note that in both of these cases, the first STRB is a dead-store; this might warrant further research):

	...
.Lj172:
	...
	strb	r0,[r4, #38]
	orr	r0,r0,#4
	strb	r0,[r4, #38]
	...

In fpide, an example with ADD appears - before:

	...
.section .text.n_fpide_$$_inctargetedeventptr$smallint$$smallint,"ax"
	...
	sxth	r0,r0
	add	r0,r0,#1
	sxth	r0,r0
	...

After - thanks to the second SXTH and the fact that the upper 16 bits have no bearing on the lower 16 bits during the addition, the first SXTH can be removed:

	...
.section .text.n_fpide_$$_inctargetedeventptr$smallint$$smallint,"ax"
	...
	add	r0,r0,#1
	sxth	r0,r0
	...

In fpwritepng, an example with UXTH and AND against 65280 appears (a large immediate that, for once, can be stored inside the ARM CPU's barrel shifter) - before:

	...
.section .text.n_fpwritepng$_$tfpwriterpng_$__$$_colordatacolorb$tfpcolor$$qword,"ax"
	...
	mov	r0,r1,lsr #16
	uxth	r0,r0
	and	r0,r0,#65280
	add	r0,r0,r3
	...

After:

.section .text.n_fpwritepng$_$tfpwriterpng_$__$$_colordatacolorb$tfpcolor$$qword,"ax"
	...
.section .text.n_fpwritepng$_$tfpwriterpng_$__$$_colordatacolorb$tfpcolor$$qword,"ax"
	...
	mov	r0,r1,lsr #16
	and	r0,r0,#65280
	add	r0,r0,r3
	...

In the pyacc unit, the removal of a SXTH instruction allows the combination of two arithmetic operations that were either side of it - before:

	...
.Lj733:
	add	r0,r4,#1
	sxth	r0,r0
	ldr	r1,.Lj712
	add	r0,r0,#1
	strh	r0,[r1]
	...

After (even if r4 contains information in the upper 16 bits, it has no bearing on the lower 16 bits of r0 that gets stored via the STRH instruction):

	...
.Lj733:
	ldr	r1,.Lj712
	add	r0,r4,#2
	strh	r0,[r1]
	...

Merge request reports