Performance

Comparing performance on tmpfs read accesses independently of the number of concurrent requests yields only a performance of around 80 to 85 % the number of IOPS when compared to libblkio directly (i.e. https://gitlab.com/hreitz/libblkio-async-issue-5).

The reason for this is not quite clear to me yet, because the flamegraph of both looks so vastly different: The io_uring_enter() syscall only makes up 40.22 % in the libblkio-async graph, but 97.16 % in the libblkio-async-issue-5 graph. That is a bit surprising when consider the performance is not that much worse, but could be explained with other operations simply taking up more time with libblkio-async, so the performance would be that much worse if only the underlying storage’s performance was better.

However, the really interesting part is that libblkio-async-issue-5’s block graph is basically just io_uring_enter() and nothing else, whereas the rest of the flame graph for libblkio-async has the following to offer:

49.02 % is in blkio’s do_io(), i.e. 8.80 % of the whole runtime are spent on do_io() but not io_uring_enter() (why is this different from pure libblkio, where the difference is just 1.34 %?)
Ostensibly 3.90 % is spent on Mutex::try_lock() (in process_completions()), which makes no sense whatsoever, given that this is supposed to be non-blocking and just a couple harmless atomic accesses
blkio::Blkioq::read() takes 6.53 %, while pure libblkio reports only 0.89 %. Why?

Things that really are unique to libblkio-async, but are still strange:

process_completions() dropping an Arc<RequestState>: 1.47 % (should just be atomic_sub())

Other things, which are unique to libblkio-async, and most of the time still don’t make sense:

process_completions():
- Dropping the MutexGuard on &mut Blkioq takes process_completions() 0.67 %
poll():
- Cloning a waker (tokio::util::wake::clone_arc_raw()): 3.22 % (basically just Arc::clone(), i.e. atomic_add())
- Dropping a tokio waker: 2.07 %
- Dropping the MutexGuard on a waker: 4.76 %
- Locking a mutex (presumably on a waker): 4.53 %
new_request():
- Just cloning the just-created Arc<RequestState>: 3.40 %
- Allocating the Arc<RequestState>: 3.03 %, 2.00 % spent in malloc() (this is the one that actually seems reasonable)
- Cloning Arc<Queue>: 1.03 %
read() (which calls new_request()):
- Locking the Mutex<&mut Blkioq>: 1.86 % (should never be contested, because there is no multithreading)
- Dropping that guard: 1.19 %
Dropping the Arc<RequestState> completely: 6.38 % (might be reasonable, but seems much), only 1.62 % spent on cfree()

So all in all the Flamegraph makes no sense for a lot of things. There is an malloc() here and a cfree() there that might make up 4 %, but not 15 to 20.