Cache content corrupted in subsequent runs once cache-extractor is killed
When cache-extractor is killed due to OOM (related to gitlab-org/gitlab-runner#27984 (closed)), subsequent retries of the same job that are successfully able to download & extract the cache fail due to cache corruption.
Setup
- GitLab Runner 15.9, 16.1, 16.2, 16.3
- Kubernetes Executor,
- no mounted volumes related to build dir (only configs are mounted)
CACHE_TYPE: 's3'
CACHE_PATH // default
CACHE_SHARED: true
FF_USE_FASTZIP: true
We also tried to disable FASTZIP, it did not help.
Details
We're caching our MAVEN directory for Java application -> Cache ZIP contains JAR files inside. What happens is that when cache extractor is killed in the process of extraction, some of the extracted JAR files are incomplete. Cache is not reuploaded in such case.
The ZIP file itself is valid on S3 - we have successfully downloaded it and were able to build the application with it. The same job with the same cache key also completes just fine on other gitlab runners in the meantime.
The issue is that subsequent runs of the same job on the same runner show "Successful extraction" yet the build fails due to corrupted cache content.
Unfortunatelly we were unable to discover the real reason behind this behaviour.
- We're sure that the content on S3 is valid (successful build)
- Other jobs on different runners complete successfully with the same cache
- There is no explicit volume mount in our configuration.
- It's always the same pattern: killed cache extract -> subsequent job run fails
Ideas: Perhaps the content of the zip file downloaded via Presigned URL (S3) is cached on some HTTP layer? (We're directly downloading it through S3, without any CDN)? Or does the runner use some hidden local storage caching?
We managed to overcome this issue by:
- raising memory limits (not 100% accurate, perhaps gitlab-org/gitlab-runner!4312 (merged) should fix it)
- performing the both compression and decompression ourselves
Unfortunately, we do not have a good reproducer publicly and were not able to discover the root cause of such weird behavior.
Disclaimer
It is possible that this issue is relevant only to our setup but we spent significant amount of time to discover the cause and decided to share the issue with the community.