[Cross-platform] Fixed inefficiency in register allocator
Summary
This merge request fixes in inefficiency in the register allocator that sometimes causes a real register to become unusable immediately prior to its allocation. This most often appears in x86 when %ecx is reserved for use in a shift or rotate instruction (since it's the only register allowed to be the variable index), and means that in almost all cases, another register is configured to contain the desired index, and then it is copied into %ecx rather than just configuring %ecx directly.
Additionally, some conditions were rearranged to provide a micro-boost in performance, only checking assigned(p) once because it always returns True on subsequent runs of the while loop, and checking if a tai object is of type tai_regalloc first, since this is far more likely to be the case on a regular build.
System
- Processor architecture: Cross-platform, but predominantly affects x86_64
What is the current bug behavior?
The register allocator is not as efficient as it could be when real registers are reserved and resized (e.g. around shift and rotate instructions).
What is the behavior after applying this patch?
This allocation inefficiency has been resolved.
Relevant logs and/or screenshots
Under x86_64-win64, -O4 in chmbase - before:
.section .text.n_chmbase_$$_writecompressedinteger$pointer$longword$$longword,"ax"
.balign 16,0x90
.globl CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD:
.Lc11:
.seh_proc CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
pushq %rbx
.seh_pushreg %rbx
.Lc12:
pushq %rsi
.seh_pushreg %rsi
.Lc13:
leaq -40(%rsp),%rsp
...
.Lj23:
movl $127,%r11d
movl %r9d,%ecx
shll %cl,%r11d
movslq %r11d,%rbx
...
.Lj29:
movl %edx,%r11d
movl %r9d,%ecx
shrl %cl,%r11d
andl $127,%r11d
movb %r11b,(%r8)
...
After, the procedure can be 'coloured' with one fewer register, meaning %rsi doesn't have to be pushed and popped:
.section .text.n_chmbase_$$_writecompressedinteger$pointer$longword$$longword,"ax"
.balign 16,0x90
.globl CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD:
.Lc11:
.seh_proc CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
pushq %rbx
.seh_pushreg %rbx
.Lc12:
leaq -48(%rsp),%rsp
...
.Lj23:
movl %r9d,%ecx
movl $127,%r10d
shll %cl,%r10d
movslq %r10d,%r11
...
.Lj29:
movl %r9d,%ecx
movl %edx,%r10d
shrl %cl,%r10d
andl $127,%r10d
movb %r10b,(%r8)
...
In dbffile - before:
.section .text.n_dbf_dbffile$_$tdbffile_$__$$_isnullflagset$pointer$tdbffielddef$tnullfieldflag$$boolean,"ax"
.balign 16,0x90
.globl DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN:
.Lc192:
.seh_proc DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
pushq %rbx
.seh_pushreg %rbx
.Lc193:
.seh_endprologue
...
je .Lj505
jmp .Lj503
.balign 16,0x90
.Lj504:
...
.Lj506:
xorb %al,%al
jmp .Lj503
.p2align 4,,10
.p2align 3
.Lj508:
...
movq (%r8),%rcx
movl 116(%rcx),%r11d
andl $7,%r11d
movl $1,%ebx
movl %r11d,%ecx
shll %cl,%ebx
movzbl (%r9),%ecx
andl %ecx,%ebx
setneb %al
jmp .Lj503
.balign 16,0x90
.Lj505:
...
After - this one is more profound because the removal of push %rbx allows jumps to the end to be replaced with ret, and %ecx is now configured directly rather than using %r11d and then written to %ecx/%cl.
.section .text.n_dbf_dbffile$_$tdbffile_$__$$_isnullflagset$pointer$tdbffielddef$tnullfieldflag$$boolean,"ax"
.balign 16,0x90
.globl DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN:
.Lc192:
...
je .Lj505
ret
.balign 16,0x90
.Lj504:
...
.Lj506:
xorb %al,%al
ret
.p2align 4,,10
.p2align 3
.Lj508:
...
movq (%r8),%rcx
movl 116(%rcx),%ecx
andl $7,%ecx
movl $1,%r11d
shll %cl,%r11d
movzbl (%r9),%ecx
andl %ecx,%r11d
setneb %al
ret
.balign 16,0x90
.Lj505:
Additional Notes
The inefficiency was caused by the presence of a "resize" hint (a type of tai_regalloc object) that stopped deallocations from being placed before it, which are usually placed before the allocations of other registers. In the case of the shift/rotate code generation, it caused the allocation of %ecx and then its resize to %rcx, and during a backwards search, the register allocator would hit the "resize" hint first and place the deallocation of another register after the allocation of %rcx, which ultimately means that register cannot ever be 'coloured' %rcx even if they don't otherwise overlap.
This fix doesn't affect i386 because no "resize" hint is generated on this platform, although the rearrangement of conditions will still provide a performance micro-boost.
The improved code isn't perfect. Because the full %rcx register is reserved under x86_64, the compiler sees fit to include additional andl %ecx,%ecx instructions (these were originally movl %ireg##,%ecx instructions, and %ireg## gets 'coloured' into %ecx) as part of its protections against incorrect merging and the loss of zero extensions. For example, in expr - before:
...
.Lj77:
cvtsd2si %xmm6,%rsi
movq %rsp,%rcx
call EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
cvtsd2si %xmm0,%rax
movl %eax,%ecx
shll %cl,%esi
cvtsi2ssl %esi,%xmm0
cvtss2sd %xmm0,%xmm6
jmp .Lj69
.balign 16,0x90
.Lj78:
cvtsd2si %xmm6,%rsi
movq %rsp,%rcx
call EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
cvtsd2si %xmm0,%rax
movl %eax,%ecx
shrl %cl,%esi
cvtsi2ssl %esi,%xmm0
cvtss2sd %xmm0,%xmm6
...
After:
...
.Lj77:
cvtsd2si %xmm6,%rsi
movq %rsp,%rcx
call EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
cvtsd2si %xmm0,%rcx
movslq %ecx,%rcx
andl %ecx,%ecx
shll %cl,%esi
cvtsi2ssl %esi,%xmm0
cvtss2sd %xmm0,%xmm6
jmp .Lj69
.balign 16,0x90
.Lj78:
cvtsd2si %xmm6,%rsi
movq %rsp,%rcx
call EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
cvtsd2si %xmm0,%rcx
movslq %ecx,%rcx
andl %ecx,%ecx
shrl %cl,%esi
cvtsi2ssl %esi,%xmm0
cvtss2sd %xmm0,%xmm6
...
In this case, all of the zero and sign extensions are wholly unnecessary and the movslq %ecx,%rcx; andl %ecx,%ecx pairs can be removed completely. This could be done as a peephole optimisation, but is done more fundamentally in !1130 (merged), reserving only %cl rather than %rcx since byte-sized registers are available to all (existing peephole optimisations that handle mov; and pairs rather than movx; and pairs should be able to handle the rest if needs be).