[Cross-platform / Prototype] Record field optimisation in loops (!948) · Merge requests · FPC / FPC / FPC Source

Summary

This merge request aims to lay the groundwork for optimising local record types as laid out in #39888. It aims to occasionally store individual fields inside temprefs, which are more likely to be stored in registers than on the stack, when a loop is entered so as to reduce memory latency.

System

Processor architecture: Cross-platform

What is the current bug behavior?

N/A

What is the behavior after applying this patch?

Private (local) record fields are more likely to get promoted to registers in loops.

Relevant logs and/or screenshots

An example where it works beautifully is inside TFPMedianCutQuantizer.QuickSort of the fpquantizer unit. Under x86_64-win64, -O4, an extra register is pushed to the stack and the value stored at 32(%rsp) is promoted to a register (%r10d in this case). This thus improves some comparison code in the internal loops - before:

	...
.Lj325:
	...
	movl	(%r8,%rcx),%ecx
	cmpl	32(%rsp),%ecx
	jae	.Lj324
	...
.Lj331:
	...
	movl	(%r8,%rcx),%ecx
	cmpl	32(%rsp),%ecx
	jb	.Lj330

After (note that %rcx is set to a different counter variable before each comparison):

	...
.Lj325:
	...
	cmpl	(%r8,%rcx),%r10d
	jbe	.Lj324
	...
.Lj331:
	...

	cmpl	(%r8,%rcx),%r10d
	ja	.Lj330

Additional Notes

Currently, the code is contained within twhiterepeatnode.simplify so it will execute for while and repeat..until loops, and for loops once they're converted just prior to code generation (this has the added advantage of ensuring the temps are contained within the if block that is also generated, so if its condition ia False, data is not transferred to and from temps). Originally the code was inside tloopnode.simplify and was called by tfornode.simplify via inherited, but this caused the temps to appear outside of the if block, and at this point in the compiler there aren't many options for optimising the entire node tree due to where ConvertForLoops is called.

The compiler tries to be selective in what record fields are promoted. Those that are only read once or twice and not written back tend not to be promoted because gains are marginal at best and such a value is likely going to be cached anyway. After that, it has an upper limit on how many records to promote, and then tries to cull candidates if there are too many. This is to help alleviate register pressure. Currently it prioritises those that are referenced more often (regardless of if the code in question is actually executed) and those that are written are also prioritised. This upper limit has been set experimentally, ideally based on the number of registers available on a CPU. Currently it is set to 3 for i386 and i8086, 15 for AArch64 and RiscV and 7 otherwise (which is the value for x86_64 and 32-bit Arm).

There is still a lot of work to be done to improve this and there may currently be some performance losses. Many of these cannot be fixed in this merge request and require improvements elsewhere.

In dirparse, a large record is promoted to temps, but even with the upper limit, one temp ends up getting stored on the stack, in which case we're just transferring a value from one part of the stack to the other, offering no saving:

.section .text.n_dirparse_$$_updatealignmentstr$shortstring$talignmentinfo$$boolean,"ax"
	...
	movl	572(%rsp),%eax
	movq	%rax,1400(%rsp)
	movl	576(%rsp),%ebp
	movl	580(%rsp),%r15d
	movl	584(%rsp),%r14d
	movl	588(%rsp),%r13d
	movl	592(%rsp),%r12d
	movl	596(%rsp),%edi
	...

This also showcases an apparent glitch with temprefs, because while the temp is only 32 bits in size, for some reason it is extended to 64 bits on the stack, leading to some inefficiencies (when the temp is written back to the record, everything is 32-bit):

	movl	1400(%rsp),%edx
	movl	%edx,572(%rsp)

Also in this situation, while there are improvements in some of the internal conditions, the fields are ultimately only written, never read, so these probably shouldn't be promoted to temps at all. As to why their initial value is being read in the first place, it's because the writes are inside if conditions, so may not be executed. For the code improvements in question, the following structures (for different fields) appear three times - before:

	...
.Lj23:
	...
	movl	%edi,576(%rsp)
	movl	%edi,%eax
	cmpl	%edi,580(%rsp)
	cmovgl	580(%rsp),%eax
	movl	%eax,580(%rsp)
	jmp	.Lj5
	...
.Lj26:
	...
	jne	.Lj29
	movl	%edi,580(%rsp)
	jmp	.Lj5
	.p2align 4,,10
	.p2align 3
.Lj29:
	...

After:

	...
.Lj23:
	...
	movl	%ebx,%ebp
	cmpl	%ebx,%r15d
	cmovngl	%ebx,%r15d
	...
.Lj26:
	...
	cmovel	%ebx,%r15d
	je	.Lj5

In jquant2, the register pressure and tempref glitch is more pronounced - only 3 out of 7 are stored in registers, and under -O3 it will drop to 2 out of 7 due to %ebp not being available:

.section .text.n_jquant2_$$_pass2_fs_dither$j_decompress_ptr$jsamparray$jsamparray$longint,"ax"
	...
	movl	32(%rsp),%ebp
	movl	36(%rsp),%r15d
	movl	40(%rsp),%r14d
	movl	52(%rsp),%eax
	movq	%rax,184(%rsp)
	movl	56(%rsp),%eax
	movq	%rax,200(%rsp)
	movl	60(%rsp),%eax
	movq	%rax,208(%rsp)
	movl	64(%rsp),%eax
	movq	%rax,216(%rsp)

The tempref glitch also causes problems when initialising the fields inside the loop:

.Lj317:
	xorl	%ebp,%ebp
	xorl	%r15d,%r15d
	xorl	%r14d,%r14d
	movl	$0,44(%rsp)
	movl	$0,48(%rsp)
	xorl	%eax,%eax
	movq	%rax,184(%rsp)
	movq	%rax,200(%rsp)
	xorl	%eax,%eax
	movq	%rax,208(%rsp)
	movq	%rax,216(%rsp)
	movl	88(%rsp),%eax
	movq	%rax,176(%rsp)
	movl	44(%rsp),%esi
	movl	48(%rsp),%ebx

While those inside registers get clean xorl calls and those not promoted at all get movl $0,44(%rsp), for example, those temprefs that are on the stack get fairly inefficient calls:

	xorl	%eax,%eax
	movq	%rax,184(%rsp)
	movq	%rax,200(%rsp)
	xorl	%eax,%eax
	movq	%rax,208(%rsp)
	movq	%rax,216(%rsp)

This is, however, mitigated somewhat by !946 (merged).

The other problem in this case is the fact that the temps are written back to the record at the loop's conclusion:

	movl	216(%rsp),%eax
	movl	%eax,64(%rsp)
	movl	208(%rsp),%eax
	movl	%eax,60(%rsp)
	movl	200(%rsp),%eax
	movl	%eax,56(%rsp)
	movl	184(%rsp),%eax
	movl	%eax,52(%rsp)
	movl	%r14d,40(%rsp)
	movl	%r15d,36(%rsp)
	movl	%ebp,32(%rsp)

This is good and all, but then the function epilogue is reached, so all these writebacks are dead-stores and ultimately pointless.

Ultimately, for this to truly shine, some things need to be researched.

Find ways to detect when variables are not being used: dead-stores and points where variables are not in use so their allocated register or their space on the stack can be reused.
Fix the tempref zero extension glitch.
See if there are ways to determine if a tempref might be stored on the stack instead of a register before register allocation takes place, so as to avoid 'promotion' to another spot on the stack.

[Cross-platform / Prototype] Record field optimisation in loops