Draft: optimised remembered set implementation
This is based on top of !42 (merged). I do not know if we will merge it (at least in this form). This MR is for discussion (see below).
This experiment was to combine the optimisations from the inlined version with the idea of using the remembered set.
Here are the benchmark results (boxroot
: this MR; cur_boxroot
: a copy of the previous inlining version; rem_boxroot
: @gaches's remembered-set version):
Benchmark: perm_count
---
gc: 2.10s
count: 3628800
---
boxroot: 1.90s
count: 3628800
---
cur_boxroot: 1.86s
count: 3628800
---
rem_boxroot: 1.96s
count: 3628800
---
Benchmark: synthetic
---
gc: 7.42s
---
boxroot: 5.91s
---
cur_boxroot: 5.85s
---
rem_boxroot: 6.03s
---
Benchmark: globroots
---
gc: 1.33s
---
boxroot: 1.29s
---
cur_boxroot: 1.42s
---
rem_boxroot: 1.30s
---
ocaml: 1.44s
---
dll_boxroot: 1.33s
---
generational: 1.27s
---
global: 1.32s
---
Benchmark: local_roots
---
local_roots(ROOT=local , N=1): 13.01ns
local_roots(ROOT=boxroot , N=1): 16.64ns
local_roots(ROOT=cur_boxroot , N=1): 14.09ns
local_roots(ROOT=rem_boxroot , N=1): 21.85ns
---
local_roots(ROOT=local , N=2): 24.56ns
local_roots(ROOT=boxroot , N=2): 24.92ns
local_roots(ROOT=cur_boxroot , N=2): 23.74ns
local_roots(ROOT=rem_boxroot , N=2): 33.12ns
---
local_roots(ROOT=local , N=5): 57.60ns
local_roots(ROOT=boxroot , N=5): 50.66ns
local_roots(ROOT=cur_boxroot , N=5): 50.42ns
local_roots(ROOT=rem_boxroot , N=5): 67.24ns
---
local_roots(ROOT=local , N=10): 115.20ns
local_roots(ROOT=boxroot , N=10): 95.28ns
local_roots(ROOT=cur_boxroot , N=10): 86.95ns
local_roots(ROOT=rem_boxroot , N=10): 126.44ns
---
local_roots(ROOT=local , N=100): 1261.29ns
local_roots(ROOT=boxroot , N=100): 874.96ns
local_roots(ROOT=cur_boxroot , N=100): 810.70ns
local_roots(ROOT=rem_boxroot , N=100): 1167.54ns
---
local_roots(ROOT=local , N=1000): 13106.04ns
local_roots(ROOT=boxroot , N=1000): 8467.53ns
local_roots(ROOT=cur_boxroot , N=1000): 8199.43ns
local_roots(ROOT=rem_boxroot , N=1000): 11176.96ns
---
Since we have already established that the difference in globroots
for cur_boxroot
is attributed to a negligible overhead per minor collection, at best the inlining+remembered set version from this MR does not noticeably improve over cur_boxroot. (The speedup of cur_boxroots over the other versions would need to be confirmed by controlling for alignment effects, etc.)
On the other hand, using the remembered set has hidden costs:
- The code to inline is more complicated:
- Much larger code size due to having a page table check in create and delete (see below). I am not sure that our micro-benchmarks let us see all the costs of increased code size and branch mispredictions, by nature of them being micro-benchmarks.
- More code to duplicate if we want to inline in Rust.
- Bigger dependency on internals of the OCaml GC (version-dependent implementation, respect the invariant of the remembered setwith version-local changes).
- More complicated to find in which pool to allocate next if there are two freelists.
The cur_boxroot
simply allocates unconditionally in the young pools and merges the young pools with the old pools at minor collections, and has a simple mechanism to put empty-enough pools at the beginning of its ring. It shows that it pays to remove costs for create and delete even with a more naive generational pool management.
cc @gasche : I still have to incorporate improvements from your own remembered set version (such as using a union to define slot
to avoid having casts all over the place). On the other hand I think there's a bug in your version since I think that your version of is_free_slot
can misclassify an allocated slot containing an immediate as a free slot.
Also, a notable difference is that in order to keep my simple pool-finding algorithm, I use one global young freelist, instead of separate young freelists per pool. (The downside is that you don't get the good spatial cache locality of allocating in pools, but this is counterbalanced by the fact that the young_freelist is supposed to remain small.)
Asm for this MR (optimised as much as I could):
Dump of assembler code for function boxroot_create:
0x0000000000058ed0 <+0>: endbr64
0x0000000000058ed4 <+4>: lea rax,[rip+0x64405] # 0xbd2e0 <Caml_state>
0x0000000000058edb <+11>: mov rcx,rdi
0x0000000000058ede <+14>: mov rsi,rdi
0x0000000000058ee1 <+17>: mov rax,QWORD PTR [rax]
0x0000000000058ee4 <+20>: mov rdx,QWORD PTR [rax+0x20]
0x0000000000058ee8 <+24>: mov rax,QWORD PTR [rax+0x28]
0x0000000000058eec <+28>: sub rcx,rdx
0x0000000000058eef <+31>: sub rax,rdx
0x0000000000058ef2 <+34>: cmp rcx,rax
0x0000000000058ef5 <+37>: ja 0x58f40 <boxroot_create+112>
0x0000000000058ef7 <+39>: lea rax,[rip+0x5d1f2] # 0xb60f0 <boxroot_young_fl>
0x0000000000058efe <+46>: mov r8,QWORD PTR [rax]
0x0000000000058f01 <+49>: cmp r8,rax
0x0000000000058f04 <+52>: je 0x58f20 <boxroot_create+80>
0x0000000000058f06 <+54>: mov rdx,QWORD PTR [r8]
0x0000000000058f09 <+57>: add QWORD PTR [rax+0x8],0x1
0x0000000000058f0e <+62>: and rdx,0xfffffffffffffffe
0x0000000000058f12 <+66>: mov QWORD PTR [rax],rdx
0x0000000000058f15 <+69>: mov rax,r8
0x0000000000058f18 <+72>: mov QWORD PTR [r8],rsi
0x0000000000058f1b <+75>: ret
0x0000000000058f1c <+76>: nop DWORD PTR [rax+0x0]
0x0000000000058f20 <+80>: lea rdx,[rip+0x5d1d9] # 0xb6100 <boxroot_pool_fl>
0x0000000000058f27 <+87>: mov rax,QWORD PTR [rdx]
0x0000000000058f2a <+90>: mov rdi,QWORD PTR [rax]
0x0000000000058f2d <+93>: cmp rdi,rax
0x0000000000058f30 <+96>: jne 0x58f70 <boxroot_create+160>
0x0000000000058f32 <+98>: mov rdi,rsi
0x0000000000058f35 <+101>: jmp 0x58b00 <boxroot_alloc_slot_slow>
0x0000000000058f3a <+106>: nop WORD PTR [rax+rax*1+0x0]
0x0000000000058f40 <+112>: lea rdx,[rip+0x5d1b9] # 0xb6100 <boxroot_pool_fl>
0x0000000000058f47 <+119>: mov rax,QWORD PTR [rdx]
0x0000000000058f4a <+122>: mov r8,QWORD PTR [rax]
0x0000000000058f4d <+125>: cmp r8,rax
0x0000000000058f50 <+128>: je 0x58f32 <boxroot_create+98>
0x0000000000058f52 <+130>: mov rcx,QWORD PTR [r8]
0x0000000000058f55 <+133>: mov QWORD PTR [rax],rcx
0x0000000000058f58 <+136>: mov rax,QWORD PTR [rdx]
0x0000000000058f5b <+139>: add QWORD PTR [rax+0x8],0x1
0x0000000000058f60 <+144>: mov rax,r8
0x0000000000058f63 <+147>: mov QWORD PTR [r8],rsi
0x0000000000058f66 <+150>: ret
0x0000000000058f67 <+151>: nop WORD PTR [rax+rax*1+0x0]
0x0000000000058f70 <+160>: mov rcx,QWORD PTR [rdi]
0x0000000000058f73 <+163>: mov QWORD PTR [rax],rcx
0x0000000000058f76 <+166>: mov rax,QWORD PTR [rdx]
0x0000000000058f79 <+169>: add QWORD PTR [rax+0x8],0x1
0x0000000000058f7e <+174>: jmp 0x586a0 <boxroot_write_barrier>
Dump of assembler code for function boxroot_delete:
0x0000000000058a90 <+0>: endbr64
0x0000000000058a94 <+4>: lea rax,[rip+0x64845] # 0xbd2e0 <Caml_state>
0x0000000000058a9b <+11>: mov rdx,QWORD PTR [rdi]
0x0000000000058a9e <+14>: mov rax,QWORD PTR [rax]
0x0000000000058aa1 <+17>: mov rcx,QWORD PTR [rax+0x20]
0x0000000000058aa5 <+21>: mov rax,QWORD PTR [rax+0x28]
0x0000000000058aa9 <+25>: sub rdx,rcx
0x0000000000058aac <+28>: sub rax,rcx
0x0000000000058aaf <+31>: cmp rdx,rax
0x0000000000058ab2 <+34>: ja 0x58ae0 <boxroot_delete+80>
0x0000000000058ab4 <+36>: lea rax,[rip+0x5d635] # 0xb60f0 <boxroot_young_fl>
0x0000000000058abb <+43>: mov rdx,QWORD PTR [rax]
0x0000000000058abe <+46>: or rdx,0x1
0x0000000000058ac2 <+50>: mov QWORD PTR [rdi],rdx
0x0000000000058ac5 <+53>: mov rdx,QWORD PTR [rax+0x8]
0x0000000000058ac9 <+57>: mov QWORD PTR [rax],rdi
0x0000000000058acc <+60>: lea rcx,[rdx-0x1]
0x0000000000058ad0 <+64>: and edx,0xfff
0x0000000000058ad6 <+70>: mov QWORD PTR [rax+0x8],rcx
0x0000000000058ada <+74>: je 0x58af0 <boxroot_delete+96>
0x0000000000058adc <+76>: ret
0x0000000000058add <+77>: nop DWORD PTR [rax]
0x0000000000058ae0 <+80>: mov rax,rdi
0x0000000000058ae3 <+83>: and rax,0xffffffffffffc000
0x0000000000058ae9 <+89>: mov rdx,QWORD PTR [rax]
0x0000000000058aec <+92>: jmp 0x58ac2 <boxroot_delete+50>
0x0000000000058aee <+94>: xchg ax,ax
0x0000000000058af0 <+96>: mov rdi,rax
0x0000000000058af3 <+99>: jmp 0x58790 <boxroot_try_demote_pool>
Compare with !42 (merged):
Dump of assembler code for function cur_boxroot_create:
0x0000000000059d00 <+0>: endbr64
0x0000000000059d04 <+4>: lea rax,[rip+0x5c415] # 0xb6120 <cur_boxroot_current_fl>
0x0000000000059d0b <+11>: mov rax,QWORD PTR [rax]
0x0000000000059d0e <+14>: mov r8,QWORD PTR [rax]
0x0000000000059d11 <+17>: cmp rax,r8
0x0000000000059d14 <+20>: je 0x59d30 <cur_boxroot_create+48>
0x0000000000059d16 <+22>: mov rdx,QWORD PTR [r8]
0x0000000000059d19 <+25>: add DWORD PTR [rax+0x8],0x1
0x0000000000059d1d <+29>: mov QWORD PTR [rax],rdx
0x0000000000059d20 <+32>: mov rax,r8
0x0000000000059d23 <+35>: mov QWORD PTR [r8],rdi
0x0000000000059d26 <+38>: ret
0x0000000000059d27 <+39>: nop WORD PTR [rax+rax*1+0x0]
0x0000000000059d30 <+48>: jmp 0x59a70 <cur_boxroot_alloc_slot_slow>
Dump of assembler code for function cur_boxroot_delete:
0x0000000000059a30 <+0>: endbr64
0x0000000000059a34 <+4>: mov r8,rdi
0x0000000000059a37 <+7>: and r8,0xffffffffffffc000
0x0000000000059a3e <+14>: mov rax,QWORD PTR [r8]
0x0000000000059a41 <+17>: mov QWORD PTR [rdi],rax
0x0000000000059a44 <+20>: mov eax,DWORD PTR [r8+0x8]
0x0000000000059a48 <+24>: mov QWORD PTR [r8],rdi
0x0000000000059a4b <+27>: sub eax,0x1
0x0000000000059a4e <+30>: mov DWORD PTR [r8+0x8],eax
0x0000000000059a52 <+34>: test eax,0x1fff
0x0000000000059a57 <+39>: je 0x59a60 <cur_boxroot_delete+48>
0x0000000000059a59 <+41>: ret
0x0000000000059a5a <+42>: nop WORD PTR [rax+rax*1+0x0]
0x0000000000059a60 <+48>: mov rdi,r8
0x0000000000059a63 <+51>: jmp 0x59930 <cur_boxroot_try_demote_pool>