Git archive cache
Repositories in GitLab have a 'Download ZIP' button. I call this the 'git archive' feature, because these files are created with the git archive
command. This command is CPU and IO intensive.
For a long time GitLab has had a primitive disk cache for these archive files. For each 'Download ZIP' click we first look what revision of the repository should go in the zip file. This revision is then used to look up a pre-existing file by name. If this file is not found we create the repository on the fly, and store it in the expected file so that the next download does not have to be generated.
These cached archive files are stored in one big directory that we periodically clean with a Sidekiq job. This 'one big directory' is on an NFS share and it's shared between the 'web' and 'api' machines on gitlab.com. In a world without NFS we cannot rely on this shared disk cache anymore.
We have metrics for this cache where we count the (absolute) number of hits and misses. https://prometheus.gitlab.com/graph?g0.range_input=30d&g0.expr=sum(rate(gitlab_workhorse_git_archive_cache%7Benvironment%3D%22prd%22%7D%5B5m%5D))%20by%20(result)&g0.tab=0