[Cross-platform / Prototype] Record field optimisation in loops
Summary
This merge request aims to lay the groundwork for optimising local record types as laid out in #39888. It aims to occasionally store individual fields inside temprefs, which are more likely to be stored in registers than on the stack, when a loop is entered so as to reduce memory latency.
System
- Processor architecture: Cross-platform
What is the current bug behavior?
N/A
What is the behavior after applying this patch?
Private (local) record fields are more likely to get promoted to registers in loops.
Relevant logs and/or screenshots
An example where it works beautifully is inside TFPMedianCutQuantizer.QuickSort of the fpquantizer unit. Under x86_64-win64, -O4, an extra register is pushed to the stack and the value stored at 32(%rsp) is promoted to a register (%r10d in this case). This thus improves some comparison code in the internal loops - before:
...
.Lj325:
...
movl (%r8,%rcx),%ecx
cmpl 32(%rsp),%ecx
jae .Lj324
...
.Lj331:
...
movl (%r8,%rcx),%ecx
cmpl 32(%rsp),%ecx
jb .Lj330
After (note that %rcx is set to a different counter variable before each comparison):
...
.Lj325:
...
cmpl (%r8,%rcx),%r10d
jbe .Lj324
...
.Lj331:
...
cmpl (%r8,%rcx),%r10d
ja .Lj330
Additional Notes
Currently, the code is contained within twhiterepeatnode.simplify so it will execute for while and repeat..until loops, and for loops once they're converted just prior to code generation (this has the added advantage of ensuring the temps are contained within the if block that is also generated, so if its condition ia False, data is not transferred to and from temps). Originally the code was inside tloopnode.simplify and was called by tfornode.simplify via inherited, but this caused the temps to appear outside of the if block, and at this point in the compiler there aren't many options for optimising the entire node tree due to where ConvertForLoops is called.
The compiler tries to be selective in what record fields are promoted. Those that are only read once or twice and not written back tend not to be promoted because gains are marginal at best and such a value is likely going to be cached anyway. After that, it has an upper limit on how many records to promote, and then tries to cull candidates if there are too many. This is to help alleviate register pressure. Currently it prioritises those that are referenced more often (regardless of if the code in question is actually executed) and those that are written are also prioritised. This upper limit has been set experimentally, ideally based on the number of registers available on a CPU. Currently it is set to 3 for i386 and i8086, 15 for AArch64 and RiscV and 7 otherwise (which is the value for x86_64 and 32-bit Arm).
There is still a lot of work to be done to improve this and there may currently be some performance losses. Many of these cannot be fixed in this merge request and require improvements elsewhere.
In dirparse, a large record is promoted to temps, but even with the upper limit, one temp ends up getting stored on the stack, in which case we're just transferring a value from one part of the stack to the other, offering no saving:
.section .text.n_dirparse_$$_updatealignmentstr$shortstring$talignmentinfo$$boolean,"ax"
...
movl 572(%rsp),%eax
movq %rax,1400(%rsp)
movl 576(%rsp),%ebp
movl 580(%rsp),%r15d
movl 584(%rsp),%r14d
movl 588(%rsp),%r13d
movl 592(%rsp),%r12d
movl 596(%rsp),%edi
...
This also showcases an apparent glitch with temprefs, because while the temp is only 32 bits in size, for some reason it is extended to 64 bits on the stack, leading to some inefficiencies (when the temp is written back to the record, everything is 32-bit):
movl 1400(%rsp),%edx
movl %edx,572(%rsp)
Also in this situation, while there are improvements in some of the internal conditions, the fields are ultimately only written, never read, so these probably shouldn't be promoted to temps at all. As to why their initial value is being read in the first place, it's because the writes are inside if conditions, so may not be executed. For the code improvements in question, the following structures (for different fields) appear three times - before:
...
.Lj23:
...
movl %edi,576(%rsp)
movl %edi,%eax
cmpl %edi,580(%rsp)
cmovgl 580(%rsp),%eax
movl %eax,580(%rsp)
jmp .Lj5
...
.Lj26:
...
jne .Lj29
movl %edi,580(%rsp)
jmp .Lj5
.p2align 4,,10
.p2align 3
.Lj29:
...
After:
...
.Lj23:
...
movl %ebx,%ebp
cmpl %ebx,%r15d
cmovngl %ebx,%r15d
...
.Lj26:
...
cmovel %ebx,%r15d
je .Lj5
In jquant2, the register pressure and tempref glitch is more pronounced - only 3 out of 7 are stored in registers, and under -O3 it will drop to 2 out of 7 due to %ebp not being available:
.section .text.n_jquant2_$$_pass2_fs_dither$j_decompress_ptr$jsamparray$jsamparray$longint,"ax"
...
movl 32(%rsp),%ebp
movl 36(%rsp),%r15d
movl 40(%rsp),%r14d
movl 52(%rsp),%eax
movq %rax,184(%rsp)
movl 56(%rsp),%eax
movq %rax,200(%rsp)
movl 60(%rsp),%eax
movq %rax,208(%rsp)
movl 64(%rsp),%eax
movq %rax,216(%rsp)
The tempref glitch also causes problems when initialising the fields inside the loop:
.Lj317:
xorl %ebp,%ebp
xorl %r15d,%r15d
xorl %r14d,%r14d
movl $0,44(%rsp)
movl $0,48(%rsp)
xorl %eax,%eax
movq %rax,184(%rsp)
movq %rax,200(%rsp)
xorl %eax,%eax
movq %rax,208(%rsp)
movq %rax,216(%rsp)
movl 88(%rsp),%eax
movq %rax,176(%rsp)
movl 44(%rsp),%esi
movl 48(%rsp),%ebx
While those inside registers get clean xorl calls and those not promoted at all get movl $0,44(%rsp), for example, those temprefs that are on the stack get fairly inefficient calls:
xorl %eax,%eax
movq %rax,184(%rsp)
movq %rax,200(%rsp)
xorl %eax,%eax
movq %rax,208(%rsp)
movq %rax,216(%rsp)
This is, however, mitigated somewhat by !946 (merged).
The other problem in this case is the fact that the temps are written back to the record at the loop's conclusion:
movl 216(%rsp),%eax
movl %eax,64(%rsp)
movl 208(%rsp),%eax
movl %eax,60(%rsp)
movl 200(%rsp),%eax
movl %eax,56(%rsp)
movl 184(%rsp),%eax
movl %eax,52(%rsp)
movl %r14d,40(%rsp)
movl %r15d,36(%rsp)
movl %ebp,32(%rsp)
This is good and all, but then the function epilogue is reached, so all these writebacks are dead-stores and ultimately pointless.
Ultimately, for this to truly shine, some things need to be researched.
- Find ways to detect when variables are not being used: dead-stores and points where variables are not in use so their allocated register or their space on the stack can be reused.
- Fix the tempref zero extension glitch.
- See if there are ways to determine if a tempref might be stored on the stack instead of a register before register allocation takes place, so as to avoid 'promotion' to another spot on the stack.