Performance optimizations for the file writer
synCRON:
legacy references: CDIFHD-146
affected version:
Table of contents
Changes to the product
There are currently several copies happening in the file writer, and on top of that even a fresh heap allocation.
From the projects, we have a requirement to reach something around 10GB/s at reasonably low CPU utilization. In the current state, at full CPU utilization on the critical path, a process wide limit of barely 2GB/s can be reached.
- Each sample is first serialized into a buffer.
** There shouuld be no buffer on the serializer level, but a straight pass-through to the
File
object. ** This buffer is additionally first allocated on the heap, placing additional stress on the allocator. No fresh heap allocations in this path! (This introduces a hard process-wide contention!) - Due to use of buffered (unaligned) writes, another copy happens in the kernel to memory mapped pages.
- Due to performing a synchronous write, memory management on the kernel side blocks IO syscalls from returning for a long meantime (far less than 100% active time of the backing drive).
- Loads of small writes hit the kernel without user-space buffering for mixed loads.
Benefit for customer
Store datarates ~100Gb/s
Additional information
- Remove the buffer from the Serializer. Replace by Pass-Through wrapper for
File
. - Open the output in exclusive, unbuffered mode.
- Use a queue of at least 8 aligned buffers inside
File
which are written unbuffered and asynchronously. A reasonable size per buffer is in the range of 512kB or 1MB.- Serialize directly into these buffers. This is the only copy operation allowed.
- Be aware of how to write the last chunk prior to closing the file - you will need to write the full buffer (including unused padding bytes) first and then truncate the file to get rid of the trailing padding bytes.
- Make sure to rarely resize the file. (Consider reserving in up to 512MB or even 1GB steps, resizing will stall registered file IO processing!)
- There is no seek-support or alike when going this route, but neither is this needed to write the only format we care about either.
- For Windows, use
FILE_FLAG_NO_BUFFERING
andFILE_FLAG_OVERLAPPED
.- Use
SetFileIoOverlappedRange
to avoid any overhead for memory mapping. There is no file handle registration though, so file system overhead is unavoidable.SetFileIoOverlappedRange
needs to be applied to both the buffer and theOVERLAPPED
struct!
- Use
- (optional) For Windows 11 and later, use IoUring API.
- Use
BuildIoRingRegisterBuffers
andBuildIoRingRegisterFileHandles
to avoid any overhead from memory mapping or file system. Be aware that for resizing, the file has to be re-registered.
- Use
- For Linux, use
io_uring
if available, also with explicitly unbuffered IO (O_DIRECT
). Ubuntu 18.04 has Kernel sideio_uring
support (Kernel 5.4 backported in 2020 for hardware-support), butliburing
needs to be backported manually (available via conan).- When possible, use
io_uring_register_buffers
andio_uring_register_files
to avoid any overhead from memory mapping or file system. Be aware that for resizing, the file has to be re-registered.
- When possible, use
Due to the ADTF file format not being sector aligned at least a single copy to ensure the alignment is unavoidable. However, without actually backing this by a comparatively large queue of async writes, a buffer alone would be not better than letting the kernel doing buffered writes instead. The latency of unbuffered writes needs to be compensated!
In case an extension of ADTF file format is viable, consider splitting the huge data chunks into a separate file (not interleaved within the index) which can be properly be written to and read from with 512kB super-alignment.
Letting the kernel doing the buffering is not an option, syscalls are way to expensive. The serialisier serializing in user-space was the correct idea, just not properly integrated. Actually, if we had uring-Support on all platforms we could afford this as the syscall overhead is gone from the equation, but we don't have, and it would be both more complicated and less efficient than unbuffered IO.