Skip to content

Draft: vector-based CCheckQueue

matricz requested to merge matricz/bitcoin-cash-node:checkqueue_vector into master

Note

I do not advise to merge this MR at this time. It is an idea that I entertained, and found valid, so I decided to put it in working order and contribute it to not let it be lost on a forgotten branch, and possibly for feedback from people willing to test it, for example on Scalenet.

Summary

The CCheckQueue is used in validation to run concurrent script/sig checks.

The current implementation of CCheckQueue is pretty classical, there is a shared vector to which the producer writes and from which the consumers read, protected by a shared lock. It works fine, and is further improved by !1707 (merged) (which this MR depends on). On CPUs with a massive number of cores, though, the performance of this implementation breaks down and actually reduces performance; my tentative educated guess is that this happens because of lock contention and is dependent on core clock and number of cores.

Generally, task queues are used for ongoing processes, where the number of tasks is potentially unbounded. In our usecase we have a very big precondition that we do not take advantage of: for each run, the number of checks is finite and known (or trivial to calculate).

I entertained the idea of trying to take advantage of this strong precondition. The "queue" is exchanged for a vector sized to contain all of the checks without reallocation, so that it can be concurrently written to and read from. As long as: 1. The vector doesn't need to reallocate, and 2. A concurrent read and write to the same index never happens, this behavior is well defined (link, section "Thread safety").

The basic idea is that atomics lets us set up such guarantees without locks, so that a big number of threads can work on different checks, by atomically incrementing a shared cursor via fetch_add(1). I tried to put most of the documentation in comments.

API changes are down to a minimum and all the properties of current CCheckQueue are preserved:

  1. The worker threads start processing tasks as soon as they become available (don't need to wait for all checks to be added to start)
  2. The queue is able to bail early if a single failing check is found (but not before all checks have been added, which is a current limitation that can possibly be lifted)
  3. The control/master thread can join the workers in doing tasks, once all the tasks are added
  4. Worker threads all finish asap, ie there are no threads that are still doing significant work, while others are done. (currently this is done via a clever calculation of batchsize and number of idle threads)

API changes introduced:

  1. Batch size is not needed anymore and is removed from the queue's constructor
  2. Check vector's size needs to be known for each run (to not reallocate the growing vector) at the creation of the CCheckQueueControl object

We have a somewhat extensive test suite for CCheckQueue, all of which is passing. On benchmarks, it seems to be some 2% slower than !1707 (merged), but does much better with massive core counts (doesn't break down past a threshold).

Test Plan

  • ninja all check-all
  • ../contrib/bench/bench_compare_branch.sh --filter CCheckQueue_RealBlock_32MB_NoCacheStore
Edited by matricz

Merge request reports