[x86] "x and ((1 shl y) - 1)" now uses BZHI (!315) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:bzhi-optimisation into main Nov 03, 2022

Summary

This merge request takes advantage of BMI2 (if it's enabled) to convert bitmasking sequences of the form x and ((1 shl y) - 1) (where y is a variable) into BZHI instructions at the node level.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

When compiling with -CpCOREAVX2, improvements should be made in some situations where expressions of the form x and ((1 shl y) - 1), reducing the cycle count.

Additional notes

The merge request is split up into four distinct commits:

The first introduces a new emit_reg_ref_reg routine, along with supporting methods, to permit the node-level generation of a BZHI instruction that has operands in this order.
The second converts an "and" node and a suitable subtree into a BZHI instruction.
Sometimes, the node's first pass will convert an "and" node into an "inline" node that calls in_and-assign_x_y. The third commit looks out for this sequence as well in an attempt to create a BZHI instruction.
The fourth contains 13 new tests that aim to test each part of the code generator that creates BZHI instructions in all of its forms, including ones where the shift operand is undersized and oversized.

Since the code additions share space with the previous ANDN optimisation (!305 (merged)), the if-blocks were restructured since this optimisation shares some of them. However, at the same time, a couple of internal error number clashes were fixed.

Relevant logs and/or screenshots

Parts of hlcgobj get some different registers allocated, but the main improvement is as follows - before:

	...
.Lj644:
	...
	call	*280(%rax)
	leaq	(,%rax,8),%rdx
	movl	$1,%eax
	shlx	%rdx,%rax,%rax
	subq	$1,%rax
	andq	%rsi,%rax
	...

After:

	...
.Lj644:
	...
	call	*280(%rax)
	shlq	$3,%rax
	bzhi	%rax,%rsi,%rax
	...

Like with hlcgobj, the fpreadtiff unit gets different registers allocated centred around this improvement - before:

	...
.Lj982:
	movzbl	-60(%rdx),%r8d
	movl	$1,%r9d
	shlx	%r8d,%r9d,%r8d
	subl	$1,%r8d
	andl	-52(%rdx),%r8d
	movw	%r8w,%ax
	...

After:

	...
.Lj982:
	movzbl	-60(%rdx),%r8d
	bzhi	%r8d,-52(%rdx),%ecx
	movw	%cx,%ax
	...

For the inline-node optimisation, outside of the tests that seek to trigger it, the jcphuff unit of the PasJPEG package gains an improvement - before:

	...
.Lj43:
	cmpb	$0,24(%rbx)
	jne	.Lj40
	movslq	%edi,%rax
	movl	$1,%edx
	shlx	%eax,%edx,%eax
	subl	$1,%eax
	andl	%eax,%esi
	addl	%edi,%r12d
	...

After:

	...
.Lj43:
	cmpb	$0,24(%rbx)
	jne	.Lj40
	bzhi	%edi,%esi,%esi
	addl	%edi,%r12d
	...

Edited Nov 03, 2022 by J. Gareth "Kit" Moreton

[x86] "x and ((1 shl y) - 1)" now uses BZHI

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Additional notes

Relevant logs and/or screenshots

Merge request reports