Content-defined chunking of artifacts and caches

Our zip artifacts and caches can be slow.

It's difficult to add files to an archive concurrently.
- Fastzip can do it, but it requires staging compressed files to disk before writing back to the archive.
Good for random access to specific files, but bad for random access to specific locations within a compressed file.
- GitLab Pages serves directly from artifacts. It supports HTTP range requests to files, but only when they're uncompressed. Doing so for compressed files can be quite costly.
We always upload and download everything:
- If the client already has most of the data for the archive, we still download, decompress and overwrite everything.
- If the server already has most of the data for the archive, we still compress and upload everything.

The archives we create have a very similar use pattern to incremental backups: They happen frequently and it's uncommon for much data to change from one build to the next.

I'm exploring using content-defined chunking, similar to how backup programs work, to produce archives that are faster to work with, are smaller and reduce data transfer.

On the 15th April, I downloaded all the artifacts for 15 pipelines that were run over the course of an hour from the GitLab repository.

pipeline	archives	files	compressed	uncompressed	metadata
286562328	111	67690	702.24 MiB	4.61 GiB	21.26 MiB
286562519	65	61211	644.21 MiB	3.59 GiB	19.75 MiB
286566208	110	67687	702.50 MiB	4.63 GiB	21.26 MiB
286567956	65	61273	644.11 MiB	3.60 GiB	19.77 MiB
286567980	113	67730	708.36 MiB	4.77 GiB	21.24 MiB
286574466	110	67341	699.80 MiB	4.63 GiB	21.14 MiB
286574752	106	67502	656.41 MiB	3.58 GiB	21.20 MiB
286575405	110	67666	701.46 MiB	4.61 GiB	21.25 MiB
286578721	106	67521	657.62 MiB	3.59 GiB	21.21 MiB
286578837	110	67671	701.97 MiB	4.62 GiB	21.25 MiB
286585101	12	12189	482.01 MiB	1.10 GiB	3.74 MiB
286588429	103	61236	491.91 MiB	3.19 GiB	19.20 MiB
286589205	58	57959	644.08 MiB	3.24 GiB	18.65 MiB
286590672	117	68065	712.22 MiB	4.80 GiB	21.34 MiB
286603708	103	64383	701.52 MiB	4.26 GiB	20.14 MiB

Ignoring the metadata (everything in a zip archive that isn't compressed data for a file), the total size of all archives is 9.62 GiB. That's 927124 compressed files.

However, there's only 466321 files that are unique. That's 5.07 GiB (45.05 GiB uncompressed) of data. Everything else is a duplicate.

The table below shows these results, along with using FastCDC to deduplicate based on content-defined chunks (at both 8KB and 64KB chunk sizes). To compare how compression also affects this result, I've also compressed the files/chunks with zstd.

type	compression	unique data
unique files	deflate	466321 files, 5.07 GiB (45.05 GiB uncompressed)
unique files	zstd	466321 files, 4.86 GiB (45.05 GiB uncompressed)
unique ~8KB chunks	deflate	2110220 chunks, 2.80 GiB (20.84 GiB uncompressed)
unique ~8KB chunks	zstd	2110220 chunks, 2.97 GiB (20.84 GiB uncompressed)
unique ~64KB chunks	deflate	743673 chunks, 2.81 GiB (31.13 uncompressed)
unique ~64KB chunks	zstd	743673 chunks, 3.01 GiB (31.13 uncompressed)

Result summary:

We transferred and stored a lot of duplicate data for these pipelines. Some of these artifacts I believe are stored for a month. This data was created within the space of an hour. I suspect this cost does add up. And this is only for artifacts, not cached data.
Deduplication from chunking data and only storing the unique chunks works way better than doing so at solely the file boundary.
The relationship between deduplication and compression can be non-obvious:
- When compressing whole files, zstd is a clear winner in comparison to deflate. zstd has a larger window size.
- When compressing smaller chunks, the compression ratios suffer. But based on how much we've deduplicated (which argubly is a form of compression), we're better off.
- It's worth noting though that whilst deflate provided a better compression ratio at its default compression level, zstd at its default compression level compressed the data twice as fast (the "defaults" refer to those in the Go packages).

The results above were created by chunking each file individually. This means that if a file is below 64KB, a chunk below 64KB will be produced. This isn't ideal as there's certain overheads the more chunks that you produce.

Chunks need to be content-addressable. This makes it easier to determine the difference between two archives and negotiate what is required from both the client and server in terms of objects that need to be downloaded or uploaded. If each chunk has a SHA256 address, a 1GB file chunked at 8KB will require 131072 hashes (4 MiB). At 64KB, 16384 hashes (0.5 MiB).
- The fewer the chunks, the less metadata is associated with each archive.
- If tracking these addresses in a database, this could translate to 131072 vs. 16384 inserted rows.
If using object storage for each chunk, an archive could require many requests to the object store to be fully downloaded. The latency associated with this can ruin all other gains from chunking.
If chunks are lower than your minimum size (as files can be smaller than the chunk size), compression ratios will also suffer.

Backup programs use various techniques to try and overcome these overheads:

Restic:
- Uses much larger chunks (512KB to 2MiB).
- Chunks based on file boundaries, but then packs small chunks into packfiles for object storage.
- Mostly does this as clients connect to object storage directly, and small chunks on a high latency connection is an awful experience.
Tarsnap:
- Doesn't chunk on the file boundary, but instead produces effectively a tar archive and chunks on that continous stream of data. It's worth noting that this can be considerably slower, as you cannot chunk files concurrently.
- Aims to chunk at 64KB, but uses a server close to object storage to solve the latency problem: Clients batch request the objects required and Tarsnap streams them from a server located close to the object storage.

My test implementation is slightly different to both Tarsnap and Restic but also borrows from both:

I plan on there being a server for object negotiation located close to object storage: This means I can have smaller chunks (64KB).
I chunk at the file boundary so that I can process multiple files concurrently, but then pack the chunks together deterministically.
I can throw away the addresses for each chunk, and use only the address for the packfile. This results in less metadata and fewer objects.
Packfiles are closer to the target chunk size, which should improve the compression ratio.

To be continued...

Edited Apr 16, 2021 by Arran Walker