S3 cache returns empty or outdated files after upgrading to GitLab Runner 18.3

Summary

A customer GitLab Premium customer having cache issues on AWS runner with S3 configured reported on the ticket ZD656484.

After upgrading gitlab-runner binary to version 18.3, cache functionality is creating empty files zip file (~22 bytes) or old state files that are fetch in sequential jobs instead of properly caching content when using AWS S3 distributed cache. This causes subsequent jobs that depend on cached data to fail, requiring multiple pipeline retries until the cache is properly created. The issue affects AWS deployments using S3 distributed cache and is causing all AWS deployments to fail on first run.

Issue confirmed as specific to AWS S3 distributed cache - local cache and Azure distributed cache work correctly.

Steps to reproduce

  • Run a pipeline with jobs that create and consume cache (e.g., terraform init followed by terraform validate)
  • Observe that cache appears to be created successfully but contains no actual data
  • Subsequent jobs fail because cached directories (.terraform folder) are missing
  • Multiple retries are required until cache is properly created
  • Issue persists even after removing local cache volumes configuration
  • Issue persists even after implementing cache policies and pipeline optimizations
  • Issue does NOT occur with local cache on the same AWS runners or Azure distributed cache on other providers
.gitlab-ci.yml
.terraform_base:
  tags: 
    - ${PROVIDER}
  image: 
    ...
  cache:
    key: "${ID}"
    when: always
    paths:
      - terraform-configuration/${PROVIDER}/${SERVICE}/.terraform/
      - terraform-configuration/${PROVIDER}/${SERVICE}/.terraform.lock.hcl
      - terraform-configuration/${PROVIDER}/${SERVICE}/*_generated
      - terraform-configuration/${PROVIDER}/${SERVICE}/*.auto.tfvars
      - terraform-configuration/${PROVIDER}/${SERVICE}/cache/*

pipeline_pre_provisioning:
  stage: pipeline_pre_provisioning
  extends:
    - .terraform_base
  script:
    - *job_pre_provisioning
    - Job Scripts
    - *job_post_provisioning

# Example job structure that demonstrates the issue
terraform-init:
  stage: init
  needs: 
  - pipeline_pre_provisioning
  extends:
  - .terraform_base
  script:
  - terraform init

terraform-validate:
  stage: validate
  script:
  - terraform validate
  cache:
    policy: pull
  needs: ["terraform-init"]

Actual behavior

  • First job:
Created fresh repository.
Checking out 9ccbcdc4 as detached HEAD (ref is main)...
...
Temporary file: ../../../../../../cache/<PATH>/archive_3332552881 
Uploading cache.zip to https://<S3BUCKET>/1973-14-protected 
Uploading cache 7.19 KB/7.19 KB (359.7 KB/s)   
  • Second Job (adding more files to cache path): New files added to the cache but upload is skip because Archive is up to date!
Checking cache for 1973-14-protected...
Downloading cache from https://<S3BUCKET>/1973-14-protected  ETag="588a0a025ad24a41e9613ad9197b48b9"
Downloading cache 7.19 KB/7.19 KB (35.1 MB/s)               
...
Creating cache 1973-14-protected...
.... found 3 matching artifact files and directories 
Archive is up to date!                             
Created cache
  • Thrid job
Checking cache for 1973-14-protected...
Downloading cache from <S3BUCKET>/1973-14-protected  ETag="588a0a025ad24a41e9613ad9197b48b9"
Downloading cache 7.19 KB/7.19 KB (35.1 MB/s) 

After a couple of retries, the third job

Checking cache for 1973-14-protected...
Downloading cache from https://<S3BUCKET>/1973-14-protected  ETag="1d1ef1d7b78d8568a184276f423061fd"
Downloading cache 254.10 MB/254.10 MB (58.0 MB/s)  

Expected behavior

The second job should be uploading the matching new files.

Relevant logs and/or screenshots

Can be seen in the ticket.

Environment description

Docker runner on AWS with S3 enabled.

config.toml contents
concurrent = 10
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0

[session_server]
 session_timeout = 1800

[[runners]]
 name = "cicdlp00011"
 url = "https://gitlab.com"
 id = 35068964
 token = "<placeholder>"
 token_obtained_at = 2024-04-22T19:39:02Z
 token_expires_at = 0001-01-01T00:00:00Z
 executor = "docker"
 [runners.aws]
   AssumeRoleARN = ""
 [runners.cache]
   MaxUploadedArchiveSize = 0
   Type = "s3"
   Shared = true
   [runners.cache.s3]
     ServerAddress = "s3.amazonaws.com"  # or your S3 endpoint
     AccessKey = "<placeholder>"
     SecretKey = "<placeholder>"
     BucketName = "brpcache-glr-ntmto"
     BucketLocation = "us-east-1"
     Insecure = false
 [runners.docker]
   tls_verify = false
   image = "alpine:latest"
   privileged = false
   disable_entrypoint_overwrite = false
   oom_kill_disable = false
   disable_cache = false
   volumes = ["/home/gitlab-runner/cache:/cache"]
   shm_size = 0
   network_mtu = 0

Tried removing the local cache issue still persists. Deleting the distributed cache is working.

Used GitLab Runner version

Running with gitlab-runner 18.3.1 (5a021a1c)
  on cicdlp00013 Hb54rpvJY, system ID: s_1ffd0d9f7567
  feature flags: FF_USE_FASTZIP:true

Possible fixes