[x86] Bit test peephole optimizations (!322) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:bit-test-peephole into main Nov 08, 2022

Summary

This merge request aims to convert TEST, OR and AND instructions that have already been optimised into smaller BT, BTS and BTR instructions, specifically:

TEST x,%reg/ref; JNE .Lbl, where x is a power of 2, becomes BT lb(x),%reg/ref; JC .Lbl, where lb(x) is the base-2 logarithm of x. Other conditional instructions work so long as the conditions are either E/Z or NE/NZ, which are converted into NC and C respectively.
OR x,%reg/ref, where x is a power of 2, becomes BTS lb(x),%reg/ref, although not if a TEST instruction immediately follows.
AND x,%reg/ref, where ¬x is a power of 2 (¬ = bitwise NOT), beomes BTR lb(¬x),%reg/ref, although not if a TEST instruction immediately follows.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

TEST, OR and AND instructions that utilise constants will now sometimes be converted into BT, BTS and BTR instructions respectively if the resultant machine code is smaller.

Additional notes

The bit-test instructions have been present since the 30386 processor, but it wasn't until the Intel Pentium II processor that they have been just as fast as the standard logical instructions. Therefore, the optimisations are only performed under -Os or if the optimize CPU target is Pentium II or higher (under x86_64, this is always the case). Unfortunately, they are known to be slower on AMD processors.
In accordance with making instructions as small as possible, only instructions of size S_L and S_Q are modified (S_W shows no size difference, and may actually increase if the register in use is %eax).
Optimisations are done at the post peephole stage to ensure the instructions have already been simplified as far as possible and jump chains shortened.
64-bit immediates via mov; and etc. are not yet tested and optimised. Possible future expansion.
With 32-bit OR and AND instructions, if an immediate can fit into a signed 8-bit integer, its machine code takes fewer bytes as the immediate can be stored in 8 bits rather than 32, so these instances are not optimised. TEST instructions don't have this luxury.

Relevant logs and/or screenshots

A huge number of units receive changes. In aasmbase (-O4, x86_64-win64), before:

.section .text.n_aasmbase_$$_create_smartlink_library$$boolean,"ax"
	.balign 16,0x90
.globl	AASMBASE_$$_CREATE_SMARTLINK_LIBRARY$$BOOLEAN
AASMBASE_$$_CREATE_SMARTLINK_LIBRARY$$BOOLEAN:
.Lc5:
	testl	$8192,U_$GLOBALS_$$_CURRENT_SETTINGS+72(%rip)
	je	.Lj12
	...

After (instruction is 2 bytes smaller):

.section .text.n_aasmbase_$$_create_smartlink_library$$boolean,"ax"
	.balign 16,0x90
.globl	AASMBASE_$$_CREATE_SMARTLINK_LIBRARY$$BOOLEAN
AASMBASE_$$_CREATE_SMARTLINK_LIBRARY$$BOOLEAN:
.Lc5:
	btl	$13,U_$GLOBALS_$$_CURRENT_SETTINGS+72(%rip)
	jnc	.Lj12
	...

In the aasmcpu unit - before:

.Lj505:
	movl	(%rbx),%eax
	andl	$536870880,%eax
	orl	$8192,%eax
	movl	%eax,(%rbx)
	jmp	.Lj438
	...

After (instruction is 1 byte smaller in this case rather than 2 since or imm,%eax has a shorter byte encoding over other registers):

.Lj505:
	movl	(%rbx),%eax
	andl	$536870880,%eax
	btsl	$13,%eax
	movl	%eax,(%rbx)
	jmp	.Lj438
	...

In the advancedipc unit - before:

	...
.Lj242:
	movl	$2147483647,%ecx
	call	SYSTEM_$$_RANDOM$LONGINT$$LONGINT
	movl	%eax,%ebx
	call	SYSTEM_$$_GETPROCESSID$$QWORD
	xorl	%eax,%ebx
	andl	$2147483647,%ebx
	movl	%ebx,-20(%rbp)
	...