[Cross-platform] Fixed inefficiency in register allocator
Summary
This merge request fixes in inefficiency in the register allocator that sometimes causes a real register to become unusable immediately prior to its allocation. This most often appears in x86 when %ecx
is reserved for use in a shift or rotate instruction (since it's the only register allowed to be the variable index), and means that in almost all cases, another register is configured to contain the desired index, and then it is copied into %ecx
rather than just configuring %ecx
directly.
Additionally, some conditions were rearranged to provide a micro-boost in performance, only checking assigned(p)
once because it always returns True on subsequent runs of the while
loop, and checking if a tai
object is of type tai_regalloc
first, since this is far more likely to be the case on a regular build.
System
- Processor architecture: Cross-platform, but predominantly affects x86_64
What is the current bug behavior?
The register allocator is not as efficient as it could be when real registers are reserved and resized (e.g. around shift and rotate instructions).
What is the behavior after applying this patch?
This allocation inefficiency has been resolved.
Relevant logs and/or screenshots
Under x86_64-win64
, -O4
in chmbase
- before:
.section .text.n_chmbase_$$_writecompressedinteger$pointer$longword$$longword,"ax"
.balign 16,0x90
.globl CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD:
.Lc11:
.seh_proc CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
pushq %rbx
.seh_pushreg %rbx
.Lc12:
pushq %rsi
.seh_pushreg %rsi
.Lc13:
leaq -40(%rsp),%rsp
...
.Lj23:
movl $127,%r11d
movl %r9d,%ecx
shll %cl,%r11d
movslq %r11d,%rbx
...
.Lj29:
movl %edx,%r11d
movl %r9d,%ecx
shrl %cl,%r11d
andl $127,%r11d
movb %r11b,(%r8)
...
After, the procedure can be 'coloured' with one fewer register, meaning %rsi
doesn't have to be pushed and popped:
.section .text.n_chmbase_$$_writecompressedinteger$pointer$longword$$longword,"ax"
.balign 16,0x90
.globl CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD:
.Lc11:
.seh_proc CHMBASE_$$_WRITECOMPRESSEDINTEGER$POINTER$LONGWORD$$LONGWORD
pushq %rbx
.seh_pushreg %rbx
.Lc12:
leaq -48(%rsp),%rsp
...
.Lj23:
movl %r9d,%ecx
movl $127,%r10d
shll %cl,%r10d
movslq %r10d,%r11
...
.Lj29:
movl %r9d,%ecx
movl %edx,%r10d
shrl %cl,%r10d
andl $127,%r10d
movb %r10b,(%r8)
...
In dbffile
- before:
.section .text.n_dbf_dbffile$_$tdbffile_$__$$_isnullflagset$pointer$tdbffielddef$tnullfieldflag$$boolean,"ax"
.balign 16,0x90
.globl DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN:
.Lc192:
.seh_proc DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
pushq %rbx
.seh_pushreg %rbx
.Lc193:
.seh_endprologue
...
je .Lj505
jmp .Lj503
.balign 16,0x90
.Lj504:
...
.Lj506:
xorb %al,%al
jmp .Lj503
.p2align 4,,10
.p2align 3
.Lj508:
...
movq (%r8),%rcx
movl 116(%rcx),%r11d
andl $7,%r11d
movl $1,%ebx
movl %r11d,%ecx
shll %cl,%ebx
movzbl (%r9),%ecx
andl %ecx,%ebx
setneb %al
jmp .Lj503
.balign 16,0x90
.Lj505:
...
After - this one is more profound because the removal of push %rbx
allows jumps to the end to be replaced with ret
, and %ecx
is now configured directly rather than using %r11d
and then written to %ecx
/%cl
.
.section .text.n_dbf_dbffile$_$tdbffile_$__$$_isnullflagset$pointer$tdbffielddef$tnullfieldflag$$boolean,"ax"
.balign 16,0x90
.globl DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN
DBF_DBFFILE$_$TDBFFILE_$__$$_ISNULLFLAGSET$POINTER$TDBFFIELDDEF$TNULLFIELDFLAG$$BOOLEAN:
.Lc192:
...
je .Lj505
ret
.balign 16,0x90
.Lj504:
...
.Lj506:
xorb %al,%al
ret
.p2align 4,,10
.p2align 3
.Lj508:
...
movq (%r8),%rcx
movl 116(%rcx),%ecx
andl $7,%ecx
movl $1,%r11d
shll %cl,%r11d
movzbl (%r9),%ecx
andl %ecx,%r11d
setneb %al
ret
.balign 16,0x90
.Lj505:
Additional Notes
The inefficiency was caused by the presence of a "resize" hint (a type of tai_regalloc
object) that stopped deallocations from being placed before it, which are usually placed before the allocations of other registers. In the case of the shift/rotate code generation, it caused the allocation of %ecx
and then its resize to %rcx
, and during a backwards search, the register allocator would hit the "resize" hint first and place the deallocation of another register after the allocation of %rcx
, which ultimately means that register cannot ever be 'coloured' %rcx
even if they don't otherwise overlap.
This fix doesn't affect i386 because no "resize" hint is generated on this platform, although the rearrangement of conditions will still provide a performance micro-boost.
The improved code isn't perfect. Because the full %rcx
register is reserved under x86_64, the compiler sees fit to include additional andl %ecx,%ecx
instructions (these were originally movl %ireg##,%ecx
instructions, and %ireg##
gets 'coloured' into %ecx
) as part of its protections against incorrect merging and the loss of zero extensions. For example, in expr
- before:
...
.Lj77:
cvtsd2si %xmm6,%rsi
movq %rsp,%rcx
call EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
cvtsd2si %xmm0,%rax
movl %eax,%ecx
shll %cl,%esi
cvtsi2ssl %esi,%xmm0
cvtss2sd %xmm0,%xmm6
jmp .Lj69
.balign 16,0x90
.Lj78:
cvtsd2si %xmm6,%rsi
movq %rsp,%rcx
call EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
cvtsd2si %xmm0,%rax
movl %eax,%ecx
shrl %cl,%esi
cvtsi2ssl %esi,%xmm0
cvtss2sd %xmm0,%xmm6
...
After:
...
.Lj77:
cvtsd2si %xmm6,%rsi
movq %rsp,%rcx
call EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
cvtsd2si %xmm0,%rcx
movslq %ecx,%rcx
andl %ecx,%ecx
shll %cl,%esi
cvtsi2ssl %esi,%xmm0
cvtss2sd %xmm0,%xmm6
jmp .Lj69
.balign 16,0x90
.Lj78:
cvtsd2si %xmm6,%rsi
movq %rsp,%rcx
call EXPR$_$EVAL$SHORTSTRING$DOUBLE$SMALLINT_ADD_SUBT$$DOUBLE_MULT_DIV$$DOUBLE_$$_POWER$$DOUBLE
cvtsd2si %xmm0,%rcx
movslq %ecx,%rcx
andl %ecx,%ecx
shrl %cl,%esi
cvtsi2ssl %esi,%xmm0
cvtss2sd %xmm0,%xmm6
...
In this case, all of the zero and sign extensions are wholly unnecessary and the movslq %ecx,%rcx; andl %ecx,%ecx
pairs can be removed completely. This could be done as a peephole optimisation, but is done more fundamentally in !1130, reserving only %cl
rather than %rcx
since byte-sized registers are available to all (existing peephole optimisations that handle mov; and
pairs rather than movx; and
pairs should be able to handle the rest if needs be).