S3 cache returns empty or outdated files after upgrading to GitLab Runner 18.3
Summary
A customer GitLab Premium customer having cache issues on AWS runner with S3 configured reported on the ticket ZD656484.
After upgrading gitlab-runner binary to version 18.3, cache functionality is creating empty files zip file (~22 bytes) or old state files that are fetch in sequential jobs instead of properly caching content when using AWS S3 distributed cache. This causes subsequent jobs that depend on cached data to fail, requiring multiple pipeline retries until the cache is properly created. The issue affects AWS deployments using S3 distributed cache and is causing all AWS deployments to fail on first run.
Issue confirmed as specific to AWS S3 distributed cache - local cache and Azure distributed cache work correctly.
Steps to reproduce
- Run a pipeline with jobs that create and consume cache (e.g., terraform init followed by terraform validate)
- Observe that cache appears to be created successfully but contains no actual data
- Subsequent jobs fail because cached directories (.terraform folder) are missing
- Multiple retries are required until cache is properly created
- Issue persists even after removing local cache volumes configuration
- Issue persists even after implementing cache policies and pipeline optimizations
- Issue does NOT occur with local cache on the same AWS runners or Azure distributed cache on other providers
.gitlab-ci.yml
.terraform_base:
tags:
- ${PROVIDER}
image:
...
cache:
key: "${ID}"
when: always
paths:
- terraform-configuration/${PROVIDER}/${SERVICE}/.terraform/
- terraform-configuration/${PROVIDER}/${SERVICE}/.terraform.lock.hcl
- terraform-configuration/${PROVIDER}/${SERVICE}/*_generated
- terraform-configuration/${PROVIDER}/${SERVICE}/*.auto.tfvars
- terraform-configuration/${PROVIDER}/${SERVICE}/cache/*
pipeline_pre_provisioning:
stage: pipeline_pre_provisioning
extends:
- .terraform_base
script:
- *job_pre_provisioning
- Job Scripts
- *job_post_provisioning
# Example job structure that demonstrates the issue
terraform-init:
stage: init
needs:
- pipeline_pre_provisioning
extends:
- .terraform_base
script:
- terraform init
terraform-validate:
stage: validate
script:
- terraform validate
cache:
policy: pull
needs: ["terraform-init"]
Actual behavior
- First job:
Created fresh repository.
Checking out 9ccbcdc4 as detached HEAD (ref is main)...
...
Temporary file: ../../../../../../cache/<PATH>/archive_3332552881
Uploading cache.zip to https://<S3BUCKET>/1973-14-protected
Uploading cache 7.19 KB/7.19 KB (359.7 KB/s)
- Second Job (adding more files to cache path): New files added to the cache but upload is skip because Archive is up to date!
Checking cache for 1973-14-protected...
Downloading cache from https://<S3BUCKET>/1973-14-protected ETag="588a0a025ad24a41e9613ad9197b48b9"
Downloading cache 7.19 KB/7.19 KB (35.1 MB/s)
...
Creating cache 1973-14-protected...
.... found 3 matching artifact files and directories
Archive is up to date!
Created cache
- Thrid job
Checking cache for 1973-14-protected...
Downloading cache from <S3BUCKET>/1973-14-protected ETag="588a0a025ad24a41e9613ad9197b48b9"
Downloading cache 7.19 KB/7.19 KB (35.1 MB/s)
After a couple of retries, the third job
Checking cache for 1973-14-protected...
Downloading cache from https://<S3BUCKET>/1973-14-protected ETag="1d1ef1d7b78d8568a184276f423061fd"
Downloading cache 254.10 MB/254.10 MB (58.0 MB/s)
Expected behavior
The second job should be uploading the matching new files.
Relevant logs and/or screenshots
Can be seen in the ticket.
Environment description
Docker runner on AWS with S3 enabled.
config.toml contents
concurrent = 10
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "cicdlp00011"
url = "https://gitlab.com"
id = 35068964
token = "<placeholder>"
token_obtained_at = 2024-04-22T19:39:02Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "docker"
[runners.aws]
AssumeRoleARN = ""
[runners.cache]
MaxUploadedArchiveSize = 0
Type = "s3"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com" # or your S3 endpoint
AccessKey = "<placeholder>"
SecretKey = "<placeholder>"
BucketName = "brpcache-glr-ntmto"
BucketLocation = "us-east-1"
Insecure = false
[runners.docker]
tls_verify = false
image = "alpine:latest"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/home/gitlab-runner/cache:/cache"]
shm_size = 0
network_mtu = 0
Tried removing the local cache issue still persists. Deleting the distributed cache is working.
Used GitLab Runner version
Running with gitlab-runner 18.3.1 (5a021a1c)
on cicdlp00013 Hb54rpvJY, system ID: s_1ffd0d9f7567
feature flags: FF_USE_FASTZIP:true