Replace zlib with zstd for compression in Git
Some old thread and POC patches from Peff could be found here
https://lore.kernel.org/git/20161023080552.lma2v6zxmyaiiqz5@sigill.intra.peff.net/ https://public-inbox.org/git/20160914235843.nacr54ekvl6rjipk@sigill.intra.peff.net/
To quote Peff:
So saving 10% here really _isn't_ that interesting. I mostly wanted to
confirm that we could use zstd without increasing the CPU time used for
deflating, so that we could reap the benefits on the inflate side. Which
is definitely the case. With these numbers, there's basically no
downside at all to using zstd. It's just faster to read the objects
later.
If we were designing git today, it seems like a no-brainer to use zstd
over zlib. But given backwards-compatibility issues, I'm not sure.
10-20% speedup on reading is awfully nice, but I don't think there's a
good way to gracefully transition, because zlib is part of the
on-the-wire format for serving objects. We could re-compress on the fly,
but that gets expensive (in existing cases, we can quite often serve the
zlib content straight from disk, but this would require an extra
inflate/deflate. At least we wouldn't have to reconstitute objects from
deltas, though).
A transition would probably look something like:
0. The patch below, or something like it, to teach git to read both
zlib and zstd, and optionally write zstd. We'd probably want to
make this an unconditional requirement like zlib, because the point
is for it to be available everywhere (I assume the zstd code is
pretty portable, but I haven't put it to the test).
1. Another patch to add a "zstd" capability to the protocol. This
would require teaching pack-objects an option to convert zstd back
to zlib on the fly.
Servers which handle a limited number of updated clients can switch
to zstd immediately to get the benefit, and their clients can
handle it directly. Likewise, clients of older servers may wish to
repack locally using zstd to get the benefit. They'll have to
recompress on the fly during push, but pushes are rare than other
operations (and often limited by bandwidth anyway).
2. After a while, eventually flip the default to zstd=5.
3. If "a while" is long enough, perhaps add a patch to let servers
tell clients "go upgrade" rather than recompressing on the fly.
I don't have immediate plans for any of that, but maybe something to
think about.
It seems like there will be a 10-20% speed improvement on reading at no additional CPU cost when switch compression algorithm to ZStandard. But driving such 'migration' while ensuring backward compatible with existing clients that uses Zlib would be hard.
Perhaps this is something Gitlab folks might want to tackle?