Local runner cache stale after first upload to S3

Summary

We've noticed that for all new pipelines, the first set of jobs that pull a freshly uploaded cache will redownload the local zip that was created just moments before.

This can result in the same cache being unnecessarily downloaded concurrently and impact overall pipeline/disk/network performance.

Steps to reproduce

  • A single runner with shell executor
    • in this example, we set the concurrency to 5
  • S3 Distributed Cache configured
    • Runner caches cleared
  • A pipeline with the following dependency graph
stateDiagram-v2
    direction LR
    install_deps --> test_unit
    note left of install_deps
        cache:policy: pull-push
    end note
    note right of test_unit
        cache:policy: pull
    end note
    install_deps --> test_auto
    note right of test_auto
        cache:policy: pull
    end note
    install_deps --> code_lint
    note right of code_lint
        cache:policy: pull
    end note
    install_deps --> code_scan
    note right of code_scan
        cache:policy: pull
    end note
    install_deps --> pkg_build
    note right of pkg_build
        cache:policy: pull
    end note

Actual behavior

All 5 subsequent jobs will proceed to download this cache, presenting the following in logs simultaneously:

Checking cache for cd16bacbce04f8c6317deb28ada80432f73366a8...
Runtime platform                                    arch=arm64 os=linux pid=121994 revision=0d4137b8 version=15.5.0
Downloading cache.zip from https://xxxxxxx.s3.dualstack.us-east-1.amazonaws.com/gitlab-runner/cache/project/12345/cd16bacbce04f8c6317deb28ada80432f73366a8 
Downloading cache 353.15 MB/353.15 MB (16.0 MB/s)                
Successfully extracted cache

Expected behavior

There is no need to download as the local runner cache is current. The preferred behaviour would be for the 5 jobs to instead present:

Checking cache for cd16bacbce04f8c6317deb28ada80432f73366a8...
Runtime platform                                    arch=arm64 os=linux pid=434433 revision=0d4137b8 version=15.5.0
cache.zip is up to date                            
Successfully extracted cache

Used GitLab Runner version

Version:      15.5.0
Git revision: 0d4137b8
Git branch:   15-5-stable
GO version:   go1.18.7
Built:        2022-10-22T23:52:16+0000
OS/Arch:      linux/arm64

Possible fixes

I believe it all comes down to the result of checkIfUpToDate here and the fact that the compared modification times differ ever so slightly:

  • The /home/gitlab-runner/cache/xxxxx/cache.zip created by the install_deps job has a LastModified time of: 15:38:31.725133024
    • Checked with stat command
    • Checked with os.Lstat(path).ModTime()
  • The cache object uploaded to S3 has a LastModified time of 2022-11-05T15:38:32+00:00
    • Checked with aws s3api head-objet command
  1. Maybe the Last-Modified header in the upload request to S3 is incorrect or not being applied correctly?
  2. Maybe the local file Modified Time can be adjusted to match the remote Last-Modified after upload completes?

I am not sure if this only affects S3 or all distributed cache configurations. I have not checked this against other executors either.

Workaround

Running in the background on the runners, this script will adjust the modification time of newly created cache zips using inotify-tools:

#!/usr/bin/env bash                                                 

inotifywait -m -q -r -e moved_to /home/gitlab-runner/cache |
while read dir ev file; do                                  
  if [[ "$file" =~ ^cache\.zip$ ]]; then                       
    path="${dir}${file}"                                    
    echo "Updating modification time of $path"              
    touch -c -r "$path" -d '1 min' "$path"                  
  fi                                                        
done