Skip to content

[Cross-platform] Label and align stripping rework

Summary

This merge request reworks the label stripping that is done during the peephole optimisation stage, since it caused some inefficiencies elsewhere.

  • General stripping is no longer performed during Pass 1 and Pass 2, but is instead performed during the post-peephole optimisation stage after all other major optimisations have been performed.
  • The SkipInstr set now also includes ait_align, since alignment hints offer no useful information on their own for the peephole optimiser, and legitimate ones should always precede a live label (which is not a member of SkipInstr).
  • The stripping of alignment hints is now smarter, in that alignments that appear before dead labels are now removed, and those before live labels are generally preserved, thus improving execution performance. It also handles non-jump labels better.

System

  • Processor architecture: Cross-platform (but affects x86 more than most)

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Execution speed is overall improved thanks to smarter align preservation or removal.

Additional notes

Some optimisations under x86 (and likely AArch64 later) would benefit from being able to 'resurrect' a dead label for some jump optimisations, but because the actual objects were released almost as soon as their reference counts fell to zero, this was not possible. Additionally, some inefficient alignment hints remained.

Relevant logs and/or screenshots

Using x86_64-win64, -O4 as examples, since these contain the most complex cross-jump optimisations and the like...

For a simple and common example - in the classes unit - before:

	...
.Lj1539:
	movq	%rbx,%rcx
	call	CLASSES$_$TPARSER_$__$$_HANDLENEWLINE
	jmp	.Lj1533
.Lj1535:
	nop
	leaq	32(%rsp),%rsp
	popq	%rbx
.Lc1433:
	ret

After - an align between JMP and the next label gets preserved:

	...
.Lj1539:
	movq	%rbx,%rcx
	call	CLASSES$_$TPARSER_$__$$_HANDLENEWLINE
	jmp	.Lj1533
	.balign 16,0x90
.Lj1535:
	nop
	leaq	32(%rsp),%rsp
	popq	%rbx
.Lc1433:
	ret

In the db unit - before:

	...
	movw	$1,%ax
	ret
	.balign 16,0x90
.Lj1870:
	movw	$2,%ax
	ret
	.balign 16,0x90
.Lj1867:
	xorw	%ax,%ax
	.balign 16,0x90
.Lc1473:
	ret

After - the align that appears before the final RET instruction is removed, thereby saving 14 bytes in this case:

	movw	$1,%ax
	ret
	.balign 16,0x90
.Lj1870:
	movw	$2,%ax
	ret
	.balign 16,0x90
.Lj1867:
	xorw	%ax,%ax
.Lc1473:
	ret
	...

Also in the db unit - before:

	...
	testq	%rax,%rax
	jnl	.Lj133
	movl	$-1,%edx
	jmp	.Lj134
	.p2align 4,,10
	.p2align 3
.Lj133:
	xorl	%edx,%edx
	testq	%rax,%rax
	setgb	%dl
.Lj134:
	testl	%edx,%edx
	jne	.Lj139
	...

After:

	...
	testq	%rax,%rax
	jnl	.Lj133
	movl	$-1,%edx
	jmp	.Lj134
	.p2align 4,,10
	.p2align 3
.Lj133:
	xorl	%edx,%edx
	testq	%rax,%rax
	setgb	%dl
	.p2align 4,,10
	.p2align 3
.Lj134:
	testl	%edx,%edx
	jne	.Lj139
	...

This one is possibly more questionable since .Lj134 only has one reference, which is on a nearby jump. However, there are cases else where the equivalent label in similar sequences has more than one reference. Nevertheless, future research can possibly find savings here (especially as this sequence corresponds to the "sign" function for input %rax, and the TEST instruction after .Lj134 is deterministic if control flow came from jmp .Lj134 above because of the movl $-1,%edx before it - optimising the jump to jmp .lj139 would reduce the reference count of .Lj134 to zero).

In the regexpr unit - before:

.section .text.n_regexpr_$$_isanylinebreak$ansichar$$boolean,"ax"
	.balign 16,0x90
.globl	REGEXPR_$$_ISANYLINEBREAK$ANSICHAR$$BOOLEAN
REGEXPR_$$_ISANYLINEBREAK$ANSICHAR$$BOOLEAN:
.Lc8:
	cmpb	$10,%cl
	jb	.Lj14
	subb	$12,%cl
	jbe	.Lj15
	subb	$1,%cl
	jne	.Lj14
	.balign 16,0x90
.Lj15:
	movb	$1,%al
	ret
	.balign 16,0x90
.Lj14:
	xorb	%al,%al
	.balign 16,0x90
.Lc9:
	ret

After - thanks to SkipInstr now including alignment fields, another peephole optimisation is now able to optimise this procedure further and properly convert jne .Lj14 into a SET/RET pair (and also the align before the final RET is removed), resulting in a smaller and faster function:

.section .text.n_regexpr_$$_isanylinebreak$ansichar$$boolean,"ax"
	.balign 16,0x90
.globl	REGEXPR_$$_ISANYLINEBREAK$ANSICHAR$$BOOLEAN
REGEXPR_$$_ISANYLINEBREAK$ANSICHAR$$BOOLEAN:
.Lc8:
	cmpb	$10,%cl
	jb	.Lj14
	subb	$12,%cl
	jbe	.Lj15
	subb	$1,%cl
	seteb	%al
	ret
	.balign 16,0x90
.Lj15:
	movb	$1,%al
	ret
	.balign 16,0x90
.Lj14:
	xorb	%al,%al
.Lc9:
	ret

One notable inefficiency is in the zip unit - before:

	...
	movl	%eax,%esi
	testl	%eax,%eax
	jne	.Lj142
	...
	movl	%eax,%esi
	testl	%eax,%eax
	jne	.Lj142
	...

After:

	...
	movl	%eax,%esi
	testl	%eax,%eax
	jne	.Lj130
	...
	movl	%eax,%esi
.Lj130:
	testl	%esi,%esi
	jne	.Lj142
	...

This, however, is not the fault of the label stripping reform, but an inefficiency in the "Deep MOV" optimisation that is resolved in !517 (merged).

There are other inefficiencies, like some TEST instructions don't get converted to BT now, but this is seen as an inefficiency or oversight in that particular peephole optimisation rather than a fault in the label stripping overhaul.

Edited by J. Gareth "Kit" Moreton

Merge request reports