Local runner cache stale after first upload to S3
Summary
We've noticed that for all new pipelines, the first set of jobs that pull a freshly uploaded cache will redownload the local zip that was created just moments before.
This can result in the same cache being unnecessarily downloaded concurrently and impact overall pipeline/disk/network performance.
Steps to reproduce
- A single runner with shell executor
- in this example, we set the
concurrencyto 5
- in this example, we set the
-
S3 Distributed Cache configured
- Runner caches cleared
- A pipeline with the following dependency graph
stateDiagram-v2
direction LR
install_deps --> test_unit
note left of install_deps
cache:policy: pull-push
end note
note right of test_unit
cache:policy: pull
end note
install_deps --> test_auto
note right of test_auto
cache:policy: pull
end note
install_deps --> code_lint
note right of code_lint
cache:policy: pull
end note
install_deps --> code_scan
note right of code_scan
cache:policy: pull
end note
install_deps --> pkg_build
note right of pkg_build
cache:policy: pull
end note
Actual behavior
All 5 subsequent jobs will proceed to download this cache, presenting the following in logs simultaneously:
Checking cache for cd16bacbce04f8c6317deb28ada80432f73366a8...
Runtime platform arch=arm64 os=linux pid=121994 revision=0d4137b8 version=15.5.0
Downloading cache.zip from https://xxxxxxx.s3.dualstack.us-east-1.amazonaws.com/gitlab-runner/cache/project/12345/cd16bacbce04f8c6317deb28ada80432f73366a8
Downloading cache 353.15 MB/353.15 MB (16.0 MB/s)
Successfully extracted cache
Expected behavior
There is no need to download as the local runner cache is current. The preferred behaviour would be for the 5 jobs to instead present:
Checking cache for cd16bacbce04f8c6317deb28ada80432f73366a8...
Runtime platform arch=arm64 os=linux pid=434433 revision=0d4137b8 version=15.5.0
cache.zip is up to date
Successfully extracted cache
Used GitLab Runner version
Version: 15.5.0
Git revision: 0d4137b8
Git branch: 15-5-stable
GO version: go1.18.7
Built: 2022-10-22T23:52:16+0000
OS/Arch: linux/arm64
Possible fixes
I believe it all comes down to the result of checkIfUpToDate here and the fact that the compared modification times differ ever so slightly:
- The
/home/gitlab-runner/cache/xxxxx/cache.zipcreated by theinstall_depsjob has a LastModified time of:15:38:31.725133024- Checked with
statcommand - Checked with
os.Lstat(path).ModTime()
- Checked with
- The cache object uploaded to S3 has a LastModified time of
2022-11-05T15:38:32+00:00- Checked with
aws s3api head-objetcommand
- Checked with
- Maybe the
Last-Modifiedheader in the upload request to S3 is incorrect or not being applied correctly? - Maybe the local file Modified Time can be adjusted to match the remote
Last-Modifiedafter upload completes?
I am not sure if this only affects S3 or all distributed cache configurations. I have not checked this against other executors either.
Workaround
Running in the background on the runners, this script will adjust the modification time of newly created cache zips using inotify-tools:
#!/usr/bin/env bash
inotifywait -m -q -r -e moved_to /home/gitlab-runner/cache |
while read dir ev file; do
if [[ "$file" =~ ^cache\.zip$ ]]; then
path="${dir}${file}"
echo "Updating modification time of $path"
touch -c -r "$path" -d '1 min' "$path"
fi
done