Performance optimizations for the file writer

synCRON:
  legacy references: CDIFHD-146
  affected version:

Table of contents

Changes to the product
Benefit for customer
Additional information

Changes to the product

There are currently several copies happening in the file writer, and on top of that even a fresh heap allocation.

From the projects, we have a requirement to reach something around 10GB/s at reasonably low CPU utilization. In the current state, at full CPU utilization on the critical path, a process wide limit of barely 2GB/s can be reached.

Each sample is first serialized into a buffer. ** There shouuld be no buffer on the serializer level, but a straight pass-through to the File object. ** This buffer is additionally first allocated on the heap, placing additional stress on the allocator. No fresh heap allocations in this path! (This introduces a hard process-wide contention!)
Due to use of buffered (unaligned) writes, another copy happens in the kernel to memory mapped pages.
Due to performing a synchronous write, memory management on the kernel side blocks IO syscalls from returning for a long meantime (far less than 100% active time of the backing drive).
Loads of small writes hit the kernel without user-space buffering for mixed loads.

Benefit for customer

Store datarates ~100Gb/s

Additional information

Remove the buffer from the Serializer. Replace by Pass-Through wrapper for File.
Open the output in exclusive, unbuffered mode.
Use a queue of at least 8 aligned buffers inside File which are written unbuffered and asynchronously. A reasonable size per buffer is in the range of 512kB or 1MB.
- Serialize directly into these buffers. This is the only copy operation allowed.
- Be aware of how to write the last chunk prior to closing the file - you will need to write the full buffer (including unused padding bytes) first and then truncate the file to get rid of the trailing padding bytes.
- Make sure to rarely resize the file. (Consider reserving in up to 512MB or even 1GB steps, resizing will stall registered file IO processing!)
- There is no seek-support or alike when going this route, but neither is this needed to write the only format we care about either.
For Windows, use FILE_FLAG_NO_BUFFERING and FILE_FLAG_OVERLAPPED.
- Use SetFileIoOverlappedRange to avoid any overhead for memory mapping. There is no file handle registration though, so file system overhead is unavoidable. SetFileIoOverlappedRange needs to be applied to both the buffer and the OVERLAPPED struct!
(optional) For Windows 11 and later, use IoUring API.
- Use BuildIoRingRegisterBuffers and BuildIoRingRegisterFileHandles to avoid any overhead from memory mapping or file system. Be aware that for resizing, the file has to be re-registered.
For Linux, use io_uring if available, also with explicitly unbuffered IO (O_DIRECT). Ubuntu 18.04 has Kernel side io_uring support (Kernel 5.4 backported in 2020 for hardware-support), but liburing needs to be backported manually (available via conan).
- When possible, use io_uring_register_buffers and io_uring_register_files to avoid any overhead from memory mapping or file system. Be aware that for resizing, the file has to be re-registered.

Due to the ADTF file format not being sector aligned at least a single copy to ensure the alignment is unavoidable. However, without actually backing this by a comparatively large queue of async writes, a buffer alone would be not better than letting the kernel doing buffered writes instead. The latency of unbuffered writes needs to be compensated!

In case an extension of ADTF file format is viable, consider splitting the huge data chunks into a separate file (not interleaved within the index) which can be properly be written to and read from with 512kB super-alignment.

Letting the kernel doing the buffering is not an option, syscalls are way to expensive. The serialisier serializing in user-space was the correct idea, just not properly integrated. Actually, if we had uring-Support on all platforms we could afford this as the syscall overhead is gone from the equation, but we don't have, and it would be both more complicated and less efficient than unbuffered IO.

Edited Dec 20, 2023 by Florian Roth