Improving artifact/cache archive performance (!1640) · Merge requests · GitLab.org / gitlab-runner

Arran Walker requested to merge fiveturns/gitlab-runner:fastzip into master Oct 19, 2019

What does this MR do?

Rewrites a chunk of the logic that performs archiving and extraction of artifacts and caches.

Why was this MR needed?

Current issues:

filepath.Walk is slow.
Multiple lstat calls are performed.
- filepath.Walk already provides os.FileInfo, but the information is thrown away when adding the file to a list of files to be archived.
- os.Lstat is called again when the file is added to the list, this time it's temporarily kept in a map[filename]os.FileInfo.
- The os.FileInfo data in the map is thrown away, when the list of files is sorted, returning just the slice of filenames.
- When adding the file to the archive, os.Lstat is called once again, so that the information is passed to zip.FileInfoHeader.
Whilst the OS will cache the stat information, and return much faster on subsequent calls, there's no need to be throwing the information away and allocating it again.

Using lstat to ensure a file exists before archive/extraction is racey, only opening the file will tell us whether it exists or not.
Files are extracted sequentially, rather than using goroutines for decompression to be offloaded to multiple cores.
There's no buffer reuse/recycling for files being added to, and extracted from, the archive. io.Copy is used each time, which allocates a new buffer for each file.

Solutions:

Use a faster directory traversal method. Not happy with the current solutions out there, I've written one for this purpose: https://github.com/saracen/walker
os.Lstat once, pass that to zip.FileInfoHeader.
Concurrent file compression requires harder changes, and is something we should address in the future. But concurrent decompression is easy to achieve.
Use and then reuse bufio.Writers and bufio.Readers for buffering.
The zip archiving and extraction is a generic enough problem that I think it's better to solve it with its own package(s): I've written https://github.com/saracen/fastzip and https://github.com/saracen/zipextra to perform the tasks that were previously done here, but hopefully much better. They expose a simple API to achieve everything we need.

Other issues solved

Whilst we're making these changes, I've also addressed a few other outstanding issues:

There's a few operations that happen post extraction: lchmod, chtimes and chown.
- I now use lchmod to suppress the annoying log messages users see on this preventable instance #3354. Similarily, I'm using lchtimes for setting file modification times.
- lchmod and lchtimes should never error. If you have permission to create a file, you have permission to set these.
- chown is a pain, and I've addressed it the same way as previously done, with a warning message. But I've also fixed a bug where times were not modified on the extracted file if a chown error was encountered.
Archiving prevented packaging files outside of the build directory, but there was no check in place to prevent extraction to outside of the build directory.
When finding a directory to add to the archive outside of the build directory, warning about it and ignoring it... the current implementation then proceeded to still traverse all of the subdirectories, warning that they're still outside of the build directory.

However, if the directory is outside of the build directory, but the build directory is a subdirectory of that, the current behaviour, whilst traversing all children, is that it will then add all files/directories of the build directory once it stumbles across it.

To make this more efficient, and less verbose, I instead warn about the parent directory being outside of the build directory and skip traversing all subdirectories. In the instance where the build directory is a subdirectory of that path, I detect this and instead warn that we'll jump to only including the build directory:

WARNING: path / is outside of build directory, but the build directory is a subpath, walking /srv/stuff/gitlab-runner/build-directory
The archive tests were using the current working directory and chdir to enter a temporary directory and leave once complete. Likewise, the extraction tests were using the working directory but don't enter and leave a temporary one, they write test files to wherever go test was called from. I've now added an argument to specify the root directory for both operations.

I know there's a lot to take in here but I'm having second thoughts about !1173 (closed), a merge request I created some time ago to "fix" this issue by setting the compression level. Whilst that does make things much faster, it's more of a bandaid. I think this MR should take priority, and eventually I'll port customising compression level over to this solution. It'd be a shame for everybody to blindly set the compression level to be really fast, and then not know that a lot of issues are better solved with this solution instead. This also provides improvements for everybody, not just those users that looked up the correct flags to set.

Benchmarks

Writing decent benchmarks is quite a difficult task for this, due to the size and amount of files involved and the difference in the code being tested.

Nevertheless, I've timed running both versions of gitlab-runner (the current version and new version with my code) against two large datasets to produce a rough estimate of gains.

This first table is the results of archiving and extracting with just the "store" method. No compression/decompression takes place. Gains here are from improvements in using a faster walk method, buffer recycling and concurrent extraction.

version	data	operation	method	time
current	linux repo, 70081 files, 3.8G	archiving	store	12.420s
new	linux repo, 70081 files, 3.8G	archiving	store	9.672s
current	linux repo, 70081 files, 3.8G	extracting	store	12.973s
new	linux repo, 70081 files, 3.8G	extracting	store	9.086s
current	UE4 Engine, 138085 files, 7.6G	archiving	store	30.334s
new	UE4 Engine, 138085 files, 7.6G	archiving	store	22.939s
current	UE4 Engine, 138085 files, 7.6G	extracting	store	33.263s
new	UE4 Engine, 138085 files, 7.6G	extracting	store	18.960s

In the next table, we see that concurrent extraction provides larger gains when the goroutines are busy decompressing:

version	data	operation	method	time
current	linux repo, 70081 files, 3.8G	archiving	standard deflate	2m43.397s
new	linux repo, 70081 files, 3.8G	archiving	standard deflate	2m34.231s
current	linux repo, 70081 files, 3.8G	extracting	standard deflate	52.152s
new	linux repo, 70081 files, 3.8G	extracting	standard deflate	33.800s
current	UE4 Engine, 138085 files, 7.6G	archiving	standard deflate	4m32.904s
new	UE4 Engine, 138085 files, 7.6G	archiving	standard deflate	4m19.269s
current	UE4 Engine, 138085 files, 7.6G	extracting	standard deflate	1m30.032s
new	UE4 Engine, 138085 files, 7.6G	extracting	standard deflate	29.120s

The next table swaps out the standard libary's compress/flate with github.com/klauspost/compress/flate.

klauspost recently rewrote his deflate library and it's now even faster.

version	data	operation	method	time
current	linux repo, 70081 files, 3.8G	archiving	klauspost's deflate	44.184s
new	linux repo, 70081 files, 3.8G	archiving	klauspost's deflate	35.777s
current	linux repo, 70081 files, 3.8G	extracting	klauspost's deflate	31.144s
new	linux repo, 70081 files, 3.8G	extracting	klauspost's deflate	15.062s
current	UE4 Engine, 138085 files, 7.6G	archiving	klauspost's deflate	2m8.177s
new	UE4 Engine, 138085 files, 7.6G	archiving	klauspost's deflate	1m54.711s
current	UE4 Engine, 138085 files, 7.6G	extracting	klauspost's deflate	1m22.077s
new	UE4 Engine, 138085 files, 7.6G	extracting	klauspost's deflate	17.952s

Summary

In summary, we could just use Klaus' deflate library and see improvements immediately, and perhaps we should do that whilst this MR is being discussed. But I still think this MR is worth it. With concurrent extraction and Klaus' library we've approached the time it takes to extract the non-compressed version and the bottleneck is just I/O, which is exactly what we want to happen. I also think this direction allows for further improvements, and I have some ideas about how we can achieve concurrent compression.

What are the relevant issue numbers?

#1797 (closed) https://gitlab.com/gitlab-org/gitlab-ce/issues/33179 #1765 !1173 (closed)

Edited Oct 19, 2019 by Arran Walker

Improving artifact/cache archive performance