improve the message-passing benchmark
This MR, on top of !101 (merged), aims to improve the message-passing benchmark I implemented in #87. The general setup remains the same, there is a producer domain that allocates batches of cells and a consumer domain that deallocates them.
The main changes:
- In #87 we observed that the results were less reliable if we allowed the producer to run ahead of the consumer by a lot, so now they are forced to synchronize every MAX_DELAY iteration (disabled if MAX_DELAY is set to a non-positive value).
- I initially wrote the benchmark to put fresh references in the exchanged cells, to get a mix of boxroot work and GC work, @gadmm was more interested in results where there was no additional GC pressure, so I put just
()in the exchanged cells now (but the message-passing machinery still allocates). - I rewrote the message-passing structure to use a Treiber stack instead of a contended mutex.
Preliminary results (SCHEDULER=Parallel MAX_DELAY=128 NITERS=100_000 BATCH_SIZE=1024 CELL={impl}):
| Command | Mean [s] | Relative |
|---|---|---|
ocaml |
0.776 ± 0.012 | 1.00 |
boxroot |
1.686 ± 0.013 | 2.17 ± 0.04 |
ocaml-ref |
2.681 ± 0.006 | 3.45 ± 0.05 |
generational |
9.259 ± 0.038 | 11.93 ± 0.19 |
bx-boxed |
10.731 ± 0.064 | 13.82 ± 0.22 |
boxroot is the fastest of the implementations that do something on cell creation (ocaml is a control implementation that does not box the cell value at all). bx-boxed, which is the new "one extra malloc" variant of boxroot I introduced in !101 (merged), is as slow as generational, which suggests that the overhead of generational on this benchmark may come from my system allocator rather than the OCaml runtime.
I'm not sure if there are particular things we want to evaluate through this benchmark. My understanding is that we are not claiming that boxroot is optimized for contended concurrent scenarios, only that its remote deallocation machinery is reasonably efficient and does not blow up under stress. The benchmark appears to go in this direction, but I am not sure how we could validate that it is testing this correctly. I suppose that we should try to observe the impact of implementation choices that we believe are important: if there is no clear impact, then our intuitions were wrong or the benchmark is not good enough.
So I played with:
-
Setting BXR_FORCE_REMOTE to true, and this speeds up the benchmark noticeably (from 1.1s to 0.9s) due to eliding the remoteness check (all deallocations are remote in this benchmark).
-
Disabling the
bxr_domain_lock_held()semi-fast-path inbxr_delete_aux, this slows down noticeably (from 1.1s to 1.5s).When the domain-lock-held path is disabled, remote deallocations must take the pool mutex. I think the overhead I observe is mild (smaller than I observed) due to the fact that we have a single consumer domain. snmalloc has a multi-producer multi-consumer message-passing test ( https://github.com/microsoft/snmalloc/blob/main/src/test/perf/msgpass/msgpass.cc ), and I would expect a noticeably larger overhead in that scenario.
-
In
get_available_pool, getting rid of theflush_delayed_flcall, which should lead to a loss of reuse-in-place behavior in message-passing scenarios. I observe a 25% slowdown on the benchmark when doing this. (The patch that does this is quoted below.)
diff --git i/boxroot/boxroot.c w/boxroot/boxroot.c
index b0b7348..76d4080 100644
--- i/boxroot/boxroot.c
+++ w/boxroot/boxroot.c
@@ -627,12 +627,12 @@ static pool * get_available_pool(int dom_id)
{
bxr_domain_state *local = get_bxr_domain_state(dom_id);
pool *p = pop_available_young(local);
- if (p == NULL) p = pop_available_old(local);
- if (p == NULL && local->free != NULL) p = ring_pop(&local->free);
- if (p == NULL) p = get_empty_pool();
+ if (p == NULL || is_full_pool(p)) p = pop_available_old(local);
+ if ((p == NULL || is_full_pool(p)) && local->free != NULL) p = ring_pop(&local->free);
+ if (p == NULL || is_full_pool(p)) p = get_empty_pool();
/* the young, old, or free pool might be full with pending
deallocations. */
- if (is_full_pool(p)) flush_delayed_fl(p);
+ /* if (is_full_pool(p)) flush_delayed_fl(p); */
DEBUGassert(!is_full_pool(p));
return p;
}