Fix assets caching in scheduled cache-assets:production job
What does this MR do and why?
Background
One of the optimizations for the build and deploy process is to cache assets as a generic package that can then be consumed by the build process.
Assets in this context refers to frontend assets built by the
gitlab:assets:compile rake task, which calls out to yarn. We compute a
cached-assets-hash over all frontend files. If none of these source
files changed, the build can reuse the previously compiled assets and
save approximately 40 minutes of build time.
The way this process is intended to work is via a scheduled pipeline on
gitlab-org/gitlab that runs every 2 hours. It checks the
cached-assets-hash, if no package exists, it builds an assets package
and publishes it to the package registry on gitlab-org/gitlab.
The bug
This logic was introduced by !96297 (merged). It was most recently updated by !179950 (merged).
That MR introduced a subtle bug: By changing the order of setting
$GITLAB_ASSETS_HASH and including
scripts/gitlab_component_helpers.sh, that helper library no longer is
able to consume the $GITLAB_ASSETS_HASH and instead defaults to the
string "NO_HASH".
There is no logic to fail, when no hash is supplied. And so we compute
a package URL containing the string NO_HASH. The job then publishes a
package to that URL, and on the next run it will skip re-compiling
assets, because there already is a package present under NO_HASH.
We can see that behaviour here.
The current cached assets package is 9 days old:
➜  ~ curl -I https://gitlab.com/api/v4/projects/278964/packages/generic/assets/production-ee-NO_HASH/assets-production-ee-NO_HASH-v2.tar.gz
last-modified: Wed, 05 Feb 2025 22:06:11 GMTBug impact
The saving grace is that this bug was only introduced for the scheduled job, and not for the jobs consuming that cache. Thus we avoid building and deploying omnibus packages or CNG images which contain a stale cache. We got lucky here.
The only real consequence is that we no longer get any cache hits, so the build process will always need to rebuild assets, even if none changed. This was surfaced as part of gitlab-com/gl-infra/production#19280 (closed).
The fix
This patch fixes the bug by re-introducing the original order. This allows the cache-assets:production job to produce valid assets cache packages again, which will speed up builds and deploys in cases where no assets were changed, which is crucial for rolling forward urgent fixes, as it cuts 40m from time-to-production.
Further considerations
Additional measures we should consider for more safety:
- Check for NO_HASHand bail out.
- After downloading an assets archive, validate the contained cached-assets-hash against the one from the filesystem.
References
Please include cross links to any resources that are relevant to this MR. This will give reviewers and future readers helpful context to give an efficient review of the changes introduced.
- Original implementation: ci: Download assets from generic package (!96297 - merged)
- Change that introduced the bug: Update cache-assets-base and cache-workhorse to... (!179950 - merged)
- Investigation that surfaced the bug: gitlab-com/gl-infra/production#19280 (comment 2347066074).
Kudos to @skarbek for highlighting this!
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
n/a
How to set up and validate locally
n/a, we will need to test it in the context of all of the pipelines.
We can look at the pipeline schedules for the next run of [2-hourly] [maintenance] Full test run, Repo caching, Review Apps cleanup, Caches update. It should contain a cache-assets:production job, and that job should not be trying to download from NO_CACHE. See broken example.