Content-defined chunking of artifacts and caches
Our zip artifacts and caches can be slow.
- It's difficult to add files to an archive concurrently.
- Fastzip can do it, but it requires staging compressed files to disk before writing back to the archive.
- Good for random access to specific files, but bad for random access to specific locations within a compressed file.
- GitLab Pages serves directly from artifacts. It supports HTTP range requests to files, but only when they're uncompressed. Doing so for compressed files can be quite costly.
- We always upload and download everything:
- If the client already has most of the data for the archive, we still download, decompress and overwrite everything.
- If the server already has most of the data for the archive, we still compress and upload everything.
The archives we create have a very similar use pattern to incremental backups: They happen frequently and it's uncommon for much data to change from one build to the next.
I'm exploring using content-defined chunking, similar to how backup programs work, to produce archives that are faster to work with, are smaller and reduce data transfer.
On the 15th April, I downloaded all the artifacts for 15 pipelines that were run over the course of an hour from the GitLab repository.
pipeline | archives | files | compressed | uncompressed | metadata |
---|---|---|---|---|---|
286562328 | 111 | 67690 | 702.24 MiB | 4.61 GiB | 21.26 MiB |
286562519 | 65 | 61211 | 644.21 MiB | 3.59 GiB | 19.75 MiB |
286566208 | 110 | 67687 | 702.50 MiB | 4.63 GiB | 21.26 MiB |
286567956 | 65 | 61273 | 644.11 MiB | 3.60 GiB | 19.77 MiB |
286567980 | 113 | 67730 | 708.36 MiB | 4.77 GiB | 21.24 MiB |
286574466 | 110 | 67341 | 699.80 MiB | 4.63 GiB | 21.14 MiB |
286574752 | 106 | 67502 | 656.41 MiB | 3.58 GiB | 21.20 MiB |
286575405 | 110 | 67666 | 701.46 MiB | 4.61 GiB | 21.25 MiB |
286578721 | 106 | 67521 | 657.62 MiB | 3.59 GiB | 21.21 MiB |
286578837 | 110 | 67671 | 701.97 MiB | 4.62 GiB | 21.25 MiB |
286585101 | 12 | 12189 | 482.01 MiB | 1.10 GiB | 3.74 MiB |
286588429 | 103 | 61236 | 491.91 MiB | 3.19 GiB | 19.20 MiB |
286589205 | 58 | 57959 | 644.08 MiB | 3.24 GiB | 18.65 MiB |
286590672 | 117 | 68065 | 712.22 MiB | 4.80 GiB | 21.34 MiB |
286603708 | 103 | 64383 | 701.52 MiB | 4.26 GiB | 20.14 MiB |
Ignoring the metadata (everything in a zip archive that isn't compressed data for a file), the total size of all archives is 9.62 GiB. That's 927124 compressed files.
However, there's only 466321 files that are unique. That's 5.07 GiB (45.05 GiB uncompressed) of data. Everything else is a duplicate.
The table below shows these results, along with using FastCDC to deduplicate based on content-defined chunks (at both 8KB and 64KB chunk sizes). To compare how compression also affects this result, I've also compressed the files/chunks with zstd.
type | compression | unique data |
---|---|---|
unique files | deflate | 466321 files, 5.07 GiB (45.05 GiB uncompressed) |
unique files | zstd | 466321 files, 4.86 GiB (45.05 GiB uncompressed) |
unique ~8KB chunks | deflate | 2110220 chunks, 2.80 GiB (20.84 GiB uncompressed) |
unique ~8KB chunks | zstd | 2110220 chunks, 2.97 GiB (20.84 GiB uncompressed) |
unique ~64KB chunks | deflate | 743673 chunks, 2.81 GiB (31.13 uncompressed) |
unique ~64KB chunks | zstd | 743673 chunks, 3.01 GiB (31.13 uncompressed) |
Result summary:
- We transferred and stored a lot of duplicate data for these pipelines. Some of these artifacts I believe are stored for a month. This data was created within the space of an hour. I suspect this cost does add up. And this is only for artifacts, not cached data.
- Deduplication from chunking data and only storing the unique chunks works way better than doing so at solely the file boundary.
- The relationship between deduplication and compression can be non-obvious:
- When compressing whole files,
zstd
is a clear winner in comparison todeflate
.zstd
has a larger window size. - When compressing smaller chunks, the compression ratios suffer. But based on how much we've deduplicated (which argubly is a form of compression), we're better off.
- It's worth noting though that whilst
deflate
provided a better compression ratio at its default compression level,zstd
at its default compression level compressed the data twice as fast (the "defaults" refer to those in the Go packages).
- When compressing whole files,
The results above were created by chunking each file individually. This means that if a file is below 64KB, a chunk below 64KB will be produced. This isn't ideal as there's certain overheads the more chunks that you produce.
- Chunks need to be content-addressable. This makes it easier to determine the difference between two archives and negotiate what is required from both the client and server in terms of objects that need to be downloaded or uploaded. If each chunk has a SHA256 address, a 1GB file chunked at 8KB will require 131072 hashes (4 MiB). At 64KB, 16384 hashes (0.5 MiB).
- The fewer the chunks, the less metadata is associated with each archive.
- If tracking these addresses in a database, this could translate to 131072 vs. 16384 inserted rows.
- If using object storage for each chunk, an archive could require many requests to the object store to be fully downloaded. The latency associated with this can ruin all other gains from chunking.
- If chunks are lower than your minimum size (as files can be smaller than the chunk size), compression ratios will also suffer.
Backup programs use various techniques to try and overcome these overheads:
- Restic:
- Uses much larger chunks (512KB to 2MiB).
- Chunks based on file boundaries, but then packs small chunks into packfiles for object storage.
- Mostly does this as clients connect to object storage directly, and small chunks on a high latency connection is an awful experience.
- Tarsnap:
- Doesn't chunk on the file boundary, but instead produces effectively a tar archive and chunks on that continous stream of data. It's worth noting that this can be considerably slower, as you cannot chunk files concurrently.
- Aims to chunk at 64KB, but uses a server close to object storage to solve the latency problem: Clients batch request the objects required and Tarsnap streams them from a server located close to the object storage.
My test implementation is slightly different to both Tarsnap and Restic but also borrows from both:
- I plan on there being a server for object negotiation located close to object storage: This means I can have smaller chunks (64KB).
- I chunk at the file boundary so that I can process multiple files concurrently, but then pack the chunks together deterministically.
- I can throw away the addresses for each chunk, and use only the address for the packfile. This results in less metadata and fewer objects.
- Packfiles are closer to the target chunk size, which should improve the compression ratio.
To be continued...