[x86] Code generator now generates SARX under BMI2 (!288) · Merge requests · FPC / FPC / FPC Source

J. Gareth "Kit" Moreton requested to merge CuriousKit/optimisations:more-bmi2-cg into main Sep 07, 2022

Summary

This merge request causes the code generator to generate SARX instructions instead of SAR instructions under BMI2 when the SarInt intrinsics are used.

As an additional commit, BEXTR and BZHI are now checked alongside SARX, SHRX, SARX and RORX when it comes to register spilling, since these instructions follow the same operand format.

System

Processor architecture: i386, x86_64

What is the current bug behavior?

The compiler tends to mix SAR with SHLX and SHRX.

What is the behavior after applying this patch?

The code generator now favours SARX when compiling under BMI2.

Relevant logs and/or screenshots

Improvements in the code is limited since the SarInt routines are not often used, but there are a few.

In the jwaWindows unit under x86_64-win64, -O4 - before:

.section .text.n_jwawindows_$$_int64shramod32$int64$longword$$int64,"ax"
	.balign 16,0x90
.globl	JWAWINDOWS_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64
JWAWINDOWS_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64:
.Lc11:
	movq	%rcx,%rax
	movb	%dl,%cl
	sarq	%cl,%rax
.Lc12:
	ret

After:

.section .text.n_jwawindows_$$_int64shramod32$int64$longword$$int64,"ax"
	.balign 16,0x90
.globl	JWAWINDOWS_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64
JWAWINDOWS_$$_INT64SHRAMOD32$INT64$LONGWORD$$INT64:
.Lc11:
	movzbl	%dl,%edx
	sarx	%rdx,%rcx,%rax
.Lc12:
	ret

It's not perfect though. In the calc_divconst_magic_signed routine in the cgutils unit - before:

	...
	movl	$64,%edx
	subq	%rax,%rdx
	movb	%dl,%cl
	movq	(%rdi),%rax
	shlx	%rdx,%rax,%rax
	sarq	%cl,%rax
	...

After:

	...
	movl	$64,%edx
	subq	%rax,%rdx
	movzbl	%dl,%eax
	movq	(%rdi),%rcx
	shlx	%rdx,%rcx,%rdx
	sarx	%rax,%rdx,%rax
	...

These instructions correspond to:

sign_corrective_shift := (SizeOf(aint) * 8) - N;
magic_m := SarInt64(magic_m shl sign_corrective_shift, sign_corrective_shift);

There is an inefficiency because the code generator is forced to zero-extend the selector register because the shift value in the SarInt intrinsics is a Byte. The most efficient code would be:

	...
	movl	$64,%edx
	subq	%rax,%rdx
	shlx	%rdx,(%rdi),%rcx
	sarx	%rdx,%rcx,%rax
	...

This can potentially be caught in the x86 code generator because SARX (along with SHLX and SHRX) implicitly performs a bitwise-AND operation on the selector ($1F for the 32-bit version and $3F on the 64-bit version).

Edited Sep 07, 2022 by J. Gareth "Kit" Moreton

[x86] Code generator now generates SARX under BMI2

Summary

System

What is the current bug behavior?

What is the behavior after applying this patch?

Relevant logs and/or screenshots

Merge request reports