[Cross-platform] Fixed inefficiency in register allocator (!1131) · Merge requests · FPC / FPC / FPC Source

Summary

This merge request fixes in inefficiency in the register allocator that sometimes causes a real register to become unusable immediately prior to its allocation. This most often appears in x86 when %ecx is reserved for use in a shift or rotate instruction (since it's the only register allowed to be the variable index), and means that in almost all cases, another register is configured to contain the desired index, and then it is copied into %ecx rather than just configuring %ecx directly.

Additionally, some conditions were rearranged to provide a micro-boost in performance, only checking assigned(p) once because it always returns True on subsequent runs of the while loop, and checking if a tai object is of type tai_regalloc first, since this is far more likely to be the case on a regular build.

System

Processor architecture: Cross-platform, but predominantly affects x86_64

What is the current bug behavior?

The register allocator is not as efficient as it could be when real registers are reserved and resized (e.g. around shift and rotate instructions).

What is the behavior after applying this patch?

This allocation inefficiency has been resolved.

Relevant logs and/or screenshots

Under x86_64-win64, -O4 in chmbase - before:

.section .text.n_chmbase_$$_writecompressedinteger$pointer$longword$$longword,"ax"
	.balign 16,0x90
.globl	CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD:
.Lc11:
.seh_proc CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
	pushq	%rbx
.seh_pushreg %rbx
.Lc12:
	pushq	%rsi
.seh_pushreg %rsi
.Lc13:
	leaq	-40(%rsp),%rsp
	...
.Lj23:
	movl	$127,%r11d
	movl	%r9d,%ecx
	shll	%cl,%r11d
	movslq	%r11d,%rbx
	...
.Lj29:
	movl	%edx,%r11d
	movl	%r9d,%ecx
	shrl	%cl,%r11d
	andl	$127,%r11d
	movb	%r11b,(%r8)
	...

After, the procedure can be 'coloured' with one fewer register, meaning %rsi doesn't have to be pushed and popped:

.section .text.n_chmbase_$$_writecompressedinteger$pointer$longword$$longword,"ax"
	.balign 16,0x90
.globl	CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD:
.Lc11:
.seh_proc CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
	pushq	%rbx
.seh_pushreg %rbx
.Lc12:
	leaq	-48(%rsp),%rsp
	...
.Lj23:
	movl	%r9d,%ecx
	movl	$127,%r10d
	shll	%cl,%r10d
	movslq	%r10d,%r11
	...
.Lj29:
	movl	%r9d,%ecx
	movl	%edx,%r10d
	shrl	%cl,%r10d
	andl	$127,%r10d
	movb	%r10b,(%r8)
	...

In dbffile - before:

.section .text.n_dbf_dbffile$_$tdbffile_$__$$_isnullflagset$pointer$tdbffielddef$tnullfieldflag$$boolean,"ax"
	.balign 16,0x90
.globl	DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN:
.Lc192:
.seh_proc DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
	pushq	%rbx
.seh_pushreg %rbx
.Lc193:
.seh_endprologue
	...
	je	.Lj505
	jmp	.Lj503
	.balign 16,0x90
.Lj504:
	...
.Lj506:
	xorb	%al,%al
	jmp	.Lj503
	.p2align 4,,10
	.p2align 3
.Lj508:
	...
	movq	(%r8),%rcx
	movl	116(%rcx),%r11d
	andl	$7,%r11d
	movl	$1,%ebx
	movl	%r11d,%ecx
	shll	%cl,%ebx
	movzbl	(%r9),%ecx
	andl	%ecx,%ebx
	setneb	%al
	jmp	.Lj503
	.balign 16,0x90
.Lj505:
	...

After - this one is more profound because the removal of push %rbx allows jumps to the end to be replaced with ret, and %ecx is now configured directly rather than using %r11d and then written to %ecx/%cl.

.section .text.n_dbf_dbffile$_$tdbffile_$__$$_isnullflagset$pointer$tdbffielddef$tnullfieldflag$$boolean,"ax"
	.balign 16,0x90
.globl	DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN:
.Lc192:
	...
	je	.Lj505
	ret
	.balign 16,0x90
.Lj504:
	...
.Lj506:
	xorb	%al,%al
	ret
	.p2align 4,,10
	.p2align 3
.Lj508:
	...
	movq	(%r8),%rcx
	movl	116(%rcx),%ecx
	andl	$7,%ecx
	movl	$1,%r11d
	shll	%cl,%r11d
	movzbl	(%r9),%ecx
	andl	%ecx,%r11d
	setneb	%al
	ret
	.balign 16,0x90
.Lj505:

Additional Notes

The inefficiency was caused by the presence of a "resize" hint (a type of tai_regalloc object) that stopped deallocations from being placed before it, which are usually placed before the allocations of other registers. In the case of the shift/rotate code generation, it caused the allocation of %ecx and then its resize to %rcx, and during a backwards search, the register allocator would hit the "resize" hint first and place the deallocation of another register after the allocation of %rcx, which ultimately means that register cannot ever be 'coloured' %rcx even if they don't otherwise overlap.

This fix doesn't affect i386 because no "resize" hint is generated on this platform, although the rearrangement of conditions will still provide a performance micro-boost.

The improved code isn't perfect. Because the full %rcx register is reserved under x86_64, the compiler sees fit to include additional andl %ecx,%ecx instructions (these were originally movl %ireg##,%ecx instructions, and %ireg## gets 'coloured' into %ecx) as part of its protections against incorrect merging and the loss of zero extensions. For example, in expr - before:

	...
.Lj77:
	cvtsd2si	%xmm6,%rsi
	movq	%rsp,%rcx
	call	EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
	cvtsd2si	%xmm0,%rax
	movl	%eax,%ecx
	shll	%cl,%esi
	cvtsi2ssl	%esi,%xmm0
	cvtss2sd	%xmm0,%xmm6
	jmp	.Lj69
	.balign 16,0x90
.Lj78:
	cvtsd2si	%xmm6,%rsi
	movq	%rsp,%rcx
	call	EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
	cvtsd2si	%xmm0,%rax
	movl	%eax,%ecx
	shrl	%cl,%esi
	cvtsi2ssl	%esi,%xmm0
	cvtss2sd	%xmm0,%xmm6
	...

After:

	...
.Lj77:
	cvtsd2si	%xmm6,%rsi
	movq	%rsp,%rcx
	call	EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
	cvtsd2si	%xmm0,%rcx
	movslq	%ecx,%rcx
	andl	%ecx,%ecx
	shll	%cl,%esi
	cvtsi2ssl	%esi,%xmm0
	cvtss2sd	%xmm0,%xmm6
	jmp	.Lj69
	.balign 16,0x90
.Lj78:
	cvtsd2si	%xmm6,%rsi
	movq	%rsp,%rcx
	call	EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
	cvtsd2si	%xmm0,%rcx
	movslq	%ecx,%rcx
	andl	%ecx,%ecx
	shrl	%cl,%esi
	cvtsi2ssl	%esi,%xmm0
	cvtss2sd	%xmm0,%xmm6
	...

In this case, all of the zero and sign extensions are wholly unnecessary and the movslq %ecx,%rcx; andl %ecx,%ecx pairs can be removed completely. This could be done as a peephole optimisation, but is done more fundamentally in !1130 (merged), reserving only %cl rather than %rcx since byte-sized registers are available to all (existing peephole optimisations that handle mov; and pairs rather than movx; and pairs should be able to handle the rest if needs be).

Edited Aug 26, 2025 by J. Gareth "Kit" Moreton

[Cross-platform] Fixed inefficiency in register allocator