Zero-downtime upgrade causes pipeline job retries to fail with missing ref error
Summary
Pipeline job retries fail with fatal: couldn't find remote ref refs/pipelines/<ID>
error during Zero-downtime upgrades from 17.11.6 to 18.0..
Steps to reproduce
- Set up a GitLab HA environment running version 17.11.6
- Perform zero-downtime upgrade to 18.0.4 following the recommended order (Rails nodes first, then Sidekiq nodes)
- After Rails nodes are upgraded but before Sidekiq nodes are upgraded:
- Run a pipeline to completion
- Retry a job from the completed pipeline
What is the current bug behavior?
Job retries fail with error: fatal: couldn't find remote ref refs/pipelines/<ID>
What is the expected correct behavior?
Job retries should work normally during zero-downtime upgrades, regardless of the order in which nodes are upgraded.
Relevant logs and/or screenshots
fatal: couldn't find remote ref refs/pipelines/4740
Possible fixes
The issue stems from inconsistent behavior between Rails and Sidekiq nodes during the upgrade window:
- Rails nodes (18.0.4) have feature flag
ci_only_one_persistent_ref_creation
enabled by default (FF was removed) - Sidekiq nodes (17.11.6) run with the feature flag disabled
- When pipeline completes, ref cleanup in Sidekiq doesn't delete the
pipeline_persistent_ref_cache_key
Redis key because the cleanup code is gated behind the feature flag (https://gitlab.com/gitlab-org/gitlab/-/blob/v17.11.6-ee/app/services/ci/pipelines/clear_persistent_ref_service.rb#L7-L8) - On job retry, CreatePersistentRefService runs in Rails with version 18.0.4, it finds the Redis key and skips ref creation, but the actual Git ref actually cleaned up.
Environment
- GitLab version: 17.11.6 → 18.0.4
- Deployment type: HA (Multi-node)
- Upgrade method: Zero-downtime upgrade
Additional context
This issue was introduced by MR !182565 (merged) which added reliance on Redis cache for ref recreation. The feature flag pipeline_persistent_ref_cache_key
was removed in 18.0 via MR !187651 (merged), creating the version mismatch scenario during zero-downtime upgrades.
Edited by 🤖 GitLab Bot 🤖