Skip to content

Draft: optimised remembered set implementation

This is based on top of !42 (merged). I do not know if we will merge it (at least in this form). This MR is for discussion (see below).

This experiment was to combine the optimisations from the inlined version with the idea of using the remembered set.

Here are the benchmark results (boxroot: this MR; cur_boxroot: a copy of the previous inlining version; rem_boxroot: @gaches's remembered-set version):

Benchmark: perm_count
---
gc: 2.10s            
count: 3628800
---
boxroot: 1.90s       
count: 3628800
---
cur_boxroot: 1.86s   
count: 3628800
---
rem_boxroot: 1.96s   
count: 3628800
---


Benchmark: synthetic
---
gc: 7.42s            
---
boxroot: 5.91s       
---
cur_boxroot: 5.85s   
---
rem_boxroot: 6.03s   
---


Benchmark: globroots
---
gc: 1.33s            
---
boxroot: 1.29s       
---
cur_boxroot: 1.42s   
---
rem_boxroot: 1.30s   
---
ocaml: 1.44s         
---
dll_boxroot: 1.33s   
---
generational: 1.27s  
---
global: 1.32s        
---


Benchmark: local_roots
---
local_roots(ROOT=local       , N=1):    13.01ns
local_roots(ROOT=boxroot     , N=1):    16.64ns
local_roots(ROOT=cur_boxroot , N=1):    14.09ns
local_roots(ROOT=rem_boxroot , N=1):    21.85ns
---
local_roots(ROOT=local       , N=2):    24.56ns
local_roots(ROOT=boxroot     , N=2):    24.92ns
local_roots(ROOT=cur_boxroot , N=2):    23.74ns
local_roots(ROOT=rem_boxroot , N=2):    33.12ns
---
local_roots(ROOT=local       , N=5):    57.60ns
local_roots(ROOT=boxroot     , N=5):    50.66ns
local_roots(ROOT=cur_boxroot , N=5):    50.42ns
local_roots(ROOT=rem_boxroot , N=5):    67.24ns
---
local_roots(ROOT=local       , N=10):   115.20ns
local_roots(ROOT=boxroot     , N=10):    95.28ns
local_roots(ROOT=cur_boxroot , N=10):    86.95ns
local_roots(ROOT=rem_boxroot , N=10):   126.44ns
---
local_roots(ROOT=local       , N=100):  1261.29ns
local_roots(ROOT=boxroot     , N=100):   874.96ns
local_roots(ROOT=cur_boxroot , N=100):   810.70ns
local_roots(ROOT=rem_boxroot , N=100):  1167.54ns
---
local_roots(ROOT=local       , N=1000): 13106.04ns
local_roots(ROOT=boxroot     , N=1000):  8467.53ns
local_roots(ROOT=cur_boxroot , N=1000):  8199.43ns
local_roots(ROOT=rem_boxroot , N=1000): 11176.96ns
---

Since we have already established that the difference in globroots for cur_boxroot is attributed to a negligible overhead per minor collection, at best the inlining+remembered set version from this MR does not noticeably improve over cur_boxroot. (The speedup of cur_boxroots over the other versions would need to be confirmed by controlling for alignment effects, etc.)

On the other hand, using the remembered set has hidden costs:

  • The code to inline is more complicated:
    • Much larger code size due to having a page table check in create and delete (see below). I am not sure that our micro-benchmarks let us see all the costs of increased code size and branch mispredictions, by nature of them being micro-benchmarks.
    • More code to duplicate if we want to inline in Rust.
  • Bigger dependency on internals of the OCaml GC (version-dependent implementation, respect the invariant of the remembered setwith version-local changes).
  • More complicated to find in which pool to allocate next if there are two freelists.

The cur_boxroot simply allocates unconditionally in the young pools and merges the young pools with the old pools at minor collections, and has a simple mechanism to put empty-enough pools at the beginning of its ring. It shows that it pays to remove costs for create and delete even with a more naive generational pool management.

cc @gasche : I still have to incorporate improvements from your own remembered set version (such as using a union to define slot to avoid having casts all over the place). On the other hand I think there's a bug in your version since I think that your version of is_free_slot can misclassify an allocated slot containing an immediate as a free slot.

Also, a notable difference is that in order to keep my simple pool-finding algorithm, I use one global young freelist, instead of separate young freelists per pool. (The downside is that you don't get the good spatial cache locality of allocating in pools, but this is counterbalanced by the fact that the young_freelist is supposed to remain small.)


Asm for this MR (optimised as much as I could):

Dump of assembler code for function boxroot_create:
   0x0000000000058ed0 <+0>:	endbr64 
   0x0000000000058ed4 <+4>:	lea    rax,[rip+0x64405]        # 0xbd2e0 <Caml_state>
   0x0000000000058edb <+11>:	mov    rcx,rdi
   0x0000000000058ede <+14>:	mov    rsi,rdi
   0x0000000000058ee1 <+17>:	mov    rax,QWORD PTR [rax]
   0x0000000000058ee4 <+20>:	mov    rdx,QWORD PTR [rax+0x20]
   0x0000000000058ee8 <+24>:	mov    rax,QWORD PTR [rax+0x28]
   0x0000000000058eec <+28>:	sub    rcx,rdx
   0x0000000000058eef <+31>:	sub    rax,rdx
   0x0000000000058ef2 <+34>:	cmp    rcx,rax
   0x0000000000058ef5 <+37>:	ja     0x58f40 <boxroot_create+112>
   0x0000000000058ef7 <+39>:	lea    rax,[rip+0x5d1f2]        # 0xb60f0 <boxroot_young_fl>
   0x0000000000058efe <+46>:	mov    r8,QWORD PTR [rax]
   0x0000000000058f01 <+49>:	cmp    r8,rax
   0x0000000000058f04 <+52>:	je     0x58f20 <boxroot_create+80>
   0x0000000000058f06 <+54>:	mov    rdx,QWORD PTR [r8]
   0x0000000000058f09 <+57>:	add    QWORD PTR [rax+0x8],0x1
   0x0000000000058f0e <+62>:	and    rdx,0xfffffffffffffffe
   0x0000000000058f12 <+66>:	mov    QWORD PTR [rax],rdx
   0x0000000000058f15 <+69>:	mov    rax,r8
   0x0000000000058f18 <+72>:	mov    QWORD PTR [r8],rsi
   0x0000000000058f1b <+75>:	ret    
   0x0000000000058f1c <+76>:	nop    DWORD PTR [rax+0x0]
   0x0000000000058f20 <+80>:	lea    rdx,[rip+0x5d1d9]        # 0xb6100 <boxroot_pool_fl>
   0x0000000000058f27 <+87>:	mov    rax,QWORD PTR [rdx]
   0x0000000000058f2a <+90>:	mov    rdi,QWORD PTR [rax]
   0x0000000000058f2d <+93>:	cmp    rdi,rax
   0x0000000000058f30 <+96>:	jne    0x58f70 <boxroot_create+160>
   0x0000000000058f32 <+98>:	mov    rdi,rsi
   0x0000000000058f35 <+101>:	jmp    0x58b00 <boxroot_alloc_slot_slow>
   0x0000000000058f3a <+106>:	nop    WORD PTR [rax+rax*1+0x0]
   0x0000000000058f40 <+112>:	lea    rdx,[rip+0x5d1b9]        # 0xb6100 <boxroot_pool_fl>
   0x0000000000058f47 <+119>:	mov    rax,QWORD PTR [rdx]
   0x0000000000058f4a <+122>:	mov    r8,QWORD PTR [rax]
   0x0000000000058f4d <+125>:	cmp    r8,rax
   0x0000000000058f50 <+128>:	je     0x58f32 <boxroot_create+98>
   0x0000000000058f52 <+130>:	mov    rcx,QWORD PTR [r8]
   0x0000000000058f55 <+133>:	mov    QWORD PTR [rax],rcx
   0x0000000000058f58 <+136>:	mov    rax,QWORD PTR [rdx]
   0x0000000000058f5b <+139>:	add    QWORD PTR [rax+0x8],0x1
   0x0000000000058f60 <+144>:	mov    rax,r8
   0x0000000000058f63 <+147>:	mov    QWORD PTR [r8],rsi
   0x0000000000058f66 <+150>:	ret    
   0x0000000000058f67 <+151>:	nop    WORD PTR [rax+rax*1+0x0]
   0x0000000000058f70 <+160>:	mov    rcx,QWORD PTR [rdi]
   0x0000000000058f73 <+163>:	mov    QWORD PTR [rax],rcx
   0x0000000000058f76 <+166>:	mov    rax,QWORD PTR [rdx]
   0x0000000000058f79 <+169>:	add    QWORD PTR [rax+0x8],0x1
   0x0000000000058f7e <+174>:	jmp    0x586a0 <boxroot_write_barrier>

Dump of assembler code for function boxroot_delete:
   0x0000000000058a90 <+0>:	endbr64 
   0x0000000000058a94 <+4>:	lea    rax,[rip+0x64845]        # 0xbd2e0 <Caml_state>
   0x0000000000058a9b <+11>:	mov    rdx,QWORD PTR [rdi]
   0x0000000000058a9e <+14>:	mov    rax,QWORD PTR [rax]
   0x0000000000058aa1 <+17>:	mov    rcx,QWORD PTR [rax+0x20]
   0x0000000000058aa5 <+21>:	mov    rax,QWORD PTR [rax+0x28]
   0x0000000000058aa9 <+25>:	sub    rdx,rcx
   0x0000000000058aac <+28>:	sub    rax,rcx
   0x0000000000058aaf <+31>:	cmp    rdx,rax
   0x0000000000058ab2 <+34>:	ja     0x58ae0 <boxroot_delete+80>
   0x0000000000058ab4 <+36>:	lea    rax,[rip+0x5d635]        # 0xb60f0 <boxroot_young_fl>
   0x0000000000058abb <+43>:	mov    rdx,QWORD PTR [rax]
   0x0000000000058abe <+46>:	or     rdx,0x1
   0x0000000000058ac2 <+50>:	mov    QWORD PTR [rdi],rdx
   0x0000000000058ac5 <+53>:	mov    rdx,QWORD PTR [rax+0x8]
   0x0000000000058ac9 <+57>:	mov    QWORD PTR [rax],rdi
   0x0000000000058acc <+60>:	lea    rcx,[rdx-0x1]
   0x0000000000058ad0 <+64>:	and    edx,0xfff
   0x0000000000058ad6 <+70>:	mov    QWORD PTR [rax+0x8],rcx
   0x0000000000058ada <+74>:	je     0x58af0 <boxroot_delete+96>
   0x0000000000058adc <+76>:	ret    
   0x0000000000058add <+77>:	nop    DWORD PTR [rax]
   0x0000000000058ae0 <+80>:	mov    rax,rdi
   0x0000000000058ae3 <+83>:	and    rax,0xffffffffffffc000
   0x0000000000058ae9 <+89>:	mov    rdx,QWORD PTR [rax]
   0x0000000000058aec <+92>:	jmp    0x58ac2 <boxroot_delete+50>
   0x0000000000058aee <+94>:	xchg   ax,ax
   0x0000000000058af0 <+96>:	mov    rdi,rax
   0x0000000000058af3 <+99>:	jmp    0x58790 <boxroot_try_demote_pool>

Compare with !42 (merged):

Dump of assembler code for function cur_boxroot_create:
   0x0000000000059d00 <+0>:	endbr64 
   0x0000000000059d04 <+4>:	lea    rax,[rip+0x5c415]        # 0xb6120 <cur_boxroot_current_fl>
   0x0000000000059d0b <+11>:	mov    rax,QWORD PTR [rax]
   0x0000000000059d0e <+14>:	mov    r8,QWORD PTR [rax]
   0x0000000000059d11 <+17>:	cmp    rax,r8
   0x0000000000059d14 <+20>:	je     0x59d30 <cur_boxroot_create+48>
   0x0000000000059d16 <+22>:	mov    rdx,QWORD PTR [r8]
   0x0000000000059d19 <+25>:	add    DWORD PTR [rax+0x8],0x1
   0x0000000000059d1d <+29>:	mov    QWORD PTR [rax],rdx
   0x0000000000059d20 <+32>:	mov    rax,r8
   0x0000000000059d23 <+35>:	mov    QWORD PTR [r8],rdi
   0x0000000000059d26 <+38>:	ret    
   0x0000000000059d27 <+39>:	nop    WORD PTR [rax+rax*1+0x0]
   0x0000000000059d30 <+48>:	jmp    0x59a70 <cur_boxroot_alloc_slot_slow>

Dump of assembler code for function cur_boxroot_delete:
   0x0000000000059a30 <+0>:	endbr64 
   0x0000000000059a34 <+4>:	mov    r8,rdi
   0x0000000000059a37 <+7>:	and    r8,0xffffffffffffc000
   0x0000000000059a3e <+14>:	mov    rax,QWORD PTR [r8]
   0x0000000000059a41 <+17>:	mov    QWORD PTR [rdi],rax
   0x0000000000059a44 <+20>:	mov    eax,DWORD PTR [r8+0x8]
   0x0000000000059a48 <+24>:	mov    QWORD PTR [r8],rdi
   0x0000000000059a4b <+27>:	sub    eax,0x1
   0x0000000000059a4e <+30>:	mov    DWORD PTR [r8+0x8],eax
   0x0000000000059a52 <+34>:	test   eax,0x1fff
   0x0000000000059a57 <+39>:	je     0x59a60 <cur_boxroot_delete+48>
   0x0000000000059a59 <+41>:	ret    
   0x0000000000059a5a <+42>:	nop    WORD PTR [rax+rax*1+0x0]
   0x0000000000059a60 <+48>:	mov    rdi,r8
   0x0000000000059a63 <+51>:	jmp    0x59930 <cur_boxroot_try_demote_pool>

Merge request reports

Loading