Skip to content

x86: New TEST optimisations

Summary

This merge request applies some optimisations to TEST/JNE pairs. Most notably, it removes superfluous TESTs if the flags haven't changed and the comparison is identical (similar to the optimisation that removes superfluous CMP instructions), but also merges tests with individual bits - notably, if a sequence such as the following shows up:

	testb	$2,144(%rax)
	jne	.Lj1691
	testb	$32,144(%rax)
	jne	.Lj1691

This can be compressed by merging the tested bits together with bitwise OR; in this case, producing:

	testb	$34,144(%rax)
	jne	.Lj1691

Note that this optimisation is not valid if "je" is used instead of "jne", but there are other combinations that are.

Besides reducing instruction count, it should provide a marginal speed increase because the number of conditional branches are reduced.

Additionally, "testb $255,###", "testw $65535,###" and "testl $4294967295,###" (all bits set for the given instruction size) get their constants changed to -1 to improve optimisation potential.

System

  • Processor architecture: i386, x86_64, possibly i8086

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

A large number of optimisations are performed in the compiler source and a few in the RTL. Similarly complex projects should also benefit.

Relevant logs and/or screenshots

Some simple examples: aasmcpu in the compiler - before:

.Lj538:
	testb	$32,32(%rsp)
	jne	.Lj544
	testb	$64,32(%rsp)
	jne	.Lj544
	testb	$128,32(%rsp)
	je	.Lj547

After:

.Lj538:
	testb	$96,32(%rsp)
	jne	.Lj544
	testb	$128,32(%rsp)
	je	.Lj547

In ncal, three TEST/JNE pairs get collapsed into a single one - before:

.Lj1006:
	movl	308(%rbx),%eax
	testl	$64,%eax
	jne	.Lj969
	testl	$8,%eax
	jne	.Lj969
	testl	$128,%eax
	jne	.Lj969

After:

.Lj1006:
	movl	308(%rbx),%eax
	testl	$200,%eax
	jne	.Lj969

In the System unit, a few superfluous TESTs get removed - before:

.Lj259:
	movzwl	(%rcx),%r11d
	movzwl	(%rdx),%ebx
	subq	%rbx,%r11
	movq	%r11,%r10
	testq	%r11,%r11
	je	.Lj263
	testq	%r11,%r11
	jnl	.Lj265

After:

.Lj259:
	movzwl	(%rcx),%r11d
	movzwl	(%rdx),%ebx
	subq	%rbx,%r11
	movq	%r11,%r10
	testq	%r11,%r11
	je	.Lj263
	jnl	.Lj265

In unicodedata, a duplicate TEST/JLE pair gets removed - before:

.section .text.n_unicodedata_$$_indexinarraydword$array_of_longword$longword$$int64,"ax"
	.balign 16,0x90
.globl	UNICODEDATA_$$_INDEXINARRAYDWORD$array_of_LONGWORD$LONGWORD$$INT64
UNICODEDATA_$$_INDEXINARRAYDWORD$array_of_LONGWORD$LONGWORD$$INT64:
	movq	$-1,%rax
	addq	$1,%rdx
	testl	%edx,%edx
	jle	.Lj364
	testl	%edx,%edx
	jle	.Lj364
        ...

After:

.section .text.n_unicodedata_$$_indexinarraydword$array_of_longword$longword$$int64,"ax"
	.balign 16,0x90
.globl	UNICODEDATA_$$_INDEXINARRAYDWORD$array_of_LONGWORD$LONGWORD$$INT64
UNICODEDATA_$$_INDEXINARRAYDWORD$array_of_LONGWORD$LONGWORD$$INT64:
	movq	$-1,%rax
	addq	$1,%rdx
	testl	%edx,%edx
	jle	.Lj364
        ...

Additional Notes

The TEST instructions are allowed to have a reference as the second operand because the registers that compose it are guarateed not to change in between. There might be some side effects if another thread modifies the same memory location, but if anything, side effects are reduced because the number of TEST calls are reduced, plus it's on the programmer to properly synchronise such multithreading.

Despite the savings in the compiler assembly, the x86_64-win64 compiler grows by 512 bytes due to code that performs jump optimisations early when it already has the information it needs, this providing a speed boost for the compiler.

Ultimately, this optimisation benefits the compiler itself more than the RTL, probably due to the large number of sets that are used as states, for example.

Edited by J. Gareth "Kit" Moreton

Merge request reports