Fix perf build-id cache expiry
Problem
The Gitaly hosts are gradually filling up their /var/log
filesystem, due to what appears to have become unbounded growth of the perf build-id cache.
The cache expiry mechanism is incompatible with Gitaly's pattern of copying its binaries into ephemeral run dirs that change after every deploy.
We need to solve this before it fills the filesystem and impacts Gitaly. Anecdotally, I suspect this will be a fairly quick fix, but if I am wrong, we can reset the deadline by deleting the cache on at-risk nodes.
Background
Periodic host profiling on the Gitaly nodes uses and maintains a shared perf build-id cache.
The cache expiry policy is based on keeping the last N versions of a binary at any given directory path. This works well for most libraries and packages.
Unfortunately, gitaly has started using a unique path for running its binaries -- a path that includes the PID of the gitaly process. This pattern makes the cache grow unbounded.
Example:
msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ sudo find /var/log/perf_build_id_cache/ -type d | sort -V | less
...
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2321115
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2321115/gitaly-git2go
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2321115/gitaly-git2go/690fc83b2f6c8e0c21d09e214fe91862f3da5222
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2321115/gitaly-hooks
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2321115/gitaly-hooks/dd0cf0720af3b2fdf827a76d2f85b8898052a4ce
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2321115/gitaly-lfs-smudge
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2321115/gitaly-lfs-smudge/d1f7cca5e6ed22f263c04ed05fd337aad8365a0a
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327407
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327407/gitaly-git2go
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327407/gitaly-git2go/035019fa5b74211edad6214b9b5de9e49e3df603
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327407/gitaly-hooks
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327407/gitaly-hooks/df7c298238c3c2c7245bcffa9551ce41abfbb85c
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327407/gitaly-lfs-smudge
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327407/gitaly-lfs-smudge/8a923a3f56c2f4be48ba1c5fe35f317c881a3c46
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613/gitaly-git2go
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613/gitaly-git2go/3391e22d1bc798af516fe154fc206a31d703bc3c
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613/gitaly-hooks
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613/gitaly-hooks/78ac670338f8556b7de82165727cef030a86f6f7
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613/gitaly-lfs-smudge
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613/gitaly-lfs-smudge/6b94ba496d08ffaacdb2d4efe205a72a46b3f3f7
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613/gitaly-ssh
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2327613/gitaly-ssh/a4411f892466db7554364a5eb0a6e594f5f015aa
/var/log/perf_build_id_cache/var/opt/gitlab/gitaly/run/gitaly-2379399
...
I think that decision was probably aiming to support a clean transition for zero downtime upgrades, such that a running gitaly process has no risk of a package upgrade replacing its support programs (e.g. git
, gitaly-hooks
, gitaly-lfs-smudge
, etc.).
We cannot expire files from the cache based purely on age, because that would break retention of rarely upgraded binaries and libraries.
This issue is to find a viable solution to this problem.
Approach
My first thought here is to add a special case handler for gitaly's path: /var/opt/gitlab/gitaly/run/gitaly-17683
. We could ignore the PID portion of that specific path, and retain only the most recent N directories under that path pattern. The rest of the cached binaries can continue to use the current logic; it looks like only gitaly is using this pattern of copying all of its binaries into an empemeral run dir.
For quick reference:
- Chef recipe: https://gitlab.com/gitlab-cookbooks/gitlab-server/-/blob/master/recipes/periodic-host-profile.rb
- Expiry script: https://gitlab.com/gitlab-cookbooks/gitlab-server/-/blob/master/files/default/expire_old_build_id_cache_entries.sh
- Config file template: https://gitlab.com/gitlab-cookbooks/gitlab-server/-/blob/master/templates/default/capture_and_upload_host_profile.conf.erb