Optimize performance for single-thread use
Historically, it hasn’t been perfectly clear how libblkio-async is to be used, i.e. whether an AsyncBlkioq can be shared/moved between threads, and whether requests can be moved to different threads (which is what e.g. tokio would automatically do when using its thread pools).
It has since turned out that these small requests are generally so fast that moving them to a different thread only makes performance worse. Instead, if we want multithreading, we should have multiple queues (one per thread). Therefore, in 82437d2a, we have dropped the Send implementation for requests (and basically decided that a single AsyncBlkioq should also not be shared between threads).
With AsyncBlkioq used only from a single thread, we can drop a lot of multi-thread safeguards that cost performance (inspired by issue #9 (closed)):
-
Arc<T>can be replaced byRc<T> -
Mutex<T>can be replaced byRefCell<T>orCell<T> - Atomics can be replaced by
Cell<T> - The CFD waiters list need not have a lock-free thread-safe implementation for enqueuing/dequeuing
This merge request implements these changes, which from preliminary testing seems to improve performance from around 80 to 85 % of raw libblkio to something like 92 %. cargo flamegraph (from a quick glance) seems to indicate that the remainder is spent on malloc()/free()[1] and general async overhead (e.g. dropping the Arc that is internal to tokio’s waker), which seems like this is basically as well as we can reasonably do.
[1] Two uses: One comes from the RequestState object; the other comes from the bench program, which allocates space for the futures, and so isn’t quite libblkio-async’s fault – maybe that’s something that can be improved upon