Proposal: Use multiple fixed-size .db files to avoid remapping

Problem

Most of the most significant bugs we've encountered in this library arise from the need to resize the '.db' files as metric data grows. This requires us to keep a WeakMap of every RString returned to Ruby and intrude into the details of Ruby's implementation to update their pointers when we re-mmap the file. There are likely additional bugs in this process that we have yet to encounter.

Proposal

As an alternative, I propose allowing the native extension to create multiple '.db' files.

A setting will be added for the maximum size of '.db' files, and new files will be truncated to that size upon creation. The maximum size would optionally be set at runtime by the caller and must be a multiple of the system page size. The default size will be 4 MiB, a multiple of standard page sizes 4 KiB (x86_64), 16 KiB (Apple aarch64), and 2 MiB (transparent huge pages).

When a file has been filled, a new file is created and mmapped, with its starting position its the header. The existing files will remain mapped and retain their existing addresses. Files will have their index appended their name, e.g. counter_worker_id_0-0_0.db, counter_worker_id_0-0_1.db, counter_worker_id_0-0_2.db.

Ruby will be unaware of which file it is accessing, the native extension will map the position provided to the appropriate file internally.

A new header format is needed to distinguish fixed-size files:

Magic number: 4 bytes, 0x4d4d4150 aka "MMAP", big-endian.
Version: 4 bytes, native-endian.
Starting Position: 4 bytes, native-endian. The absolute offset this file starts at.
Max File Size: 4 bytes, native-endian. The maximum size in bytes allowed for this file.
Used: 4 bytes, native-endian. The size of data written to the file in bytes.
Padding: 4 NUL bytes.

The existing object format continues from there.

protocol "MMAP:4,Version:4,Start:4,MaxSz:4,Used:4,Pad:4,K1 Size:4,K1 Name:4,K1 Value:8,K2 Size:4,K2 Name:4,K2 Value:8"
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  MMAP |Version| Start | MaxSz |  Used |  Pad  |K1 Size|K1 Name|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|    K1 Value   |K2 Size|K2 Name|    K2 Value   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Advantages

The address of each RString returned to Ruby by the slice method will be static for the lifetime of the process. We will no longer need to track or update the strings we create, reducing internal complexity and removing the need for this library to be aware of the internal details of Ruby's RString implementation.

Downsides

The parser currently takes a slice of the full mmapped region, which isn't possible if we split the files. An alternative here is to move the actual unpacking into the native extension, where we can handle moving between files.

Initial memory usage will be significantly higher with this approach and will grow in large chunks as each additional file is added, rather then gradually.

This will require some other adjustments to Ruby where files are directly touched, like when initializing files. Setting flocks may need to move to the native extension.

Edited Dec 12, 2023 by Will Chandler (ex-GitLab)