Performance
Comparing performance on tmpfs read accesses independently of the number of concurrent requests yields only a performance of around 80 to 85 % the number of IOPS when compared to libblkio directly (i.e. https://gitlab.com/hreitz/libblkio-async-issue-5).
The reason for this is not quite clear to me yet, because the flamegraph of both looks so vastly different: The io_uring_enter() syscall only makes up 40.22 % in the libblkio-async graph, but 97.16 % in the libblkio-async-issue-5 graph. That is a bit surprising when consider the performance is not that much worse, but could be explained with other operations simply taking up more time with libblkio-async, so the performance would be that much worse if only the underlying storage’s performance was better.
However, the really interesting part is that libblkio-async-issue-5’s block graph is basically just io_uring_enter() and nothing else, whereas the rest of the flame graph for libblkio-async has the following to offer:
- 49.02 % is in blkio’s
do_io(), i.e. 8.80 % of the whole runtime are spent ondo_io()but notio_uring_enter()(why is this different from pure libblkio, where the difference is just 1.34 %?) - Ostensibly 3.90 % is spent on
Mutex::try_lock()(inprocess_completions()), which makes no sense whatsoever, given that this is supposed to be non-blocking and just a couple harmless atomic accesses -
blkio::Blkioq::read()takes 6.53 %, while pure libblkio reports only 0.89 %. Why?
Things that really are unique to libblkio-async, but are still strange:
-
process_completions()dropping anArc<RequestState>: 1.47 % (should just beatomic_sub())
Other things, which are unique to libblkio-async, and most of the time still don’t make sense:
-
process_completions():- Dropping the
MutexGuardon&mut Blkioqtakesprocess_completions()0.67 %
- Dropping the
-
poll():- Cloning a waker (
tokio::util::wake::clone_arc_raw()): 3.22 % (basically justArc::clone(), i.e.atomic_add()) - Dropping a tokio waker: 2.07 %
- Dropping the
MutexGuardon a waker: 4.76 % - Locking a mutex (presumably on a waker): 4.53 %
- Cloning a waker (
-
new_request():- Just cloning the just-created
Arc<RequestState>: 3.40 % - Allocating the
Arc<RequestState>: 3.03 %, 2.00 % spent inmalloc()(this is the one that actually seems reasonable) - Cloning
Arc<Queue>: 1.03 %
- Just cloning the just-created
-
read()(which callsnew_request()):- Locking the
Mutex<&mut Blkioq>: 1.86 % (should never be contested, because there is no multithreading) - Dropping that guard: 1.19 %
- Locking the
- Dropping the
Arc<RequestState>completely: 6.38 % (might be reasonable, but seems much), only 1.62 % spent oncfree()
So all in all the Flamegraph makes no sense for a lot of things. There is an malloc() here and a cfree() there that might make up 4 %, but not 15 to 20.