Zero-downtime upgrade causes pipeline job retries to fail with missing ref error

Summary

Pipeline job retries fail with fatal: couldn't find remote ref refs/pipelines/<ID> error during Zero-downtime upgrades from 17.11.6 to 18.0..

Steps to reproduce

Set up a GitLab HA environment running version 17.11.6
Perform zero-downtime upgrade to 18.0.4 following the recommended order (Rails nodes first, then Sidekiq nodes)
After Rails nodes are upgraded but before Sidekiq nodes are upgraded:
- Run a pipeline to completion
- Retry a job from the completed pipeline

What is the current bug behavior?

Job retries fail with error: fatal: couldn't find remote ref refs/pipelines/<ID>

What is the expected correct behavior?

Job retries should work normally during zero-downtime upgrades, regardless of the order in which nodes are upgraded.

Relevant logs and/or screenshots

fatal: couldn't find remote ref refs/pipelines/4740

Possible fixes

The issue stems from inconsistent behavior between Rails and Sidekiq nodes during the upgrade window:

Rails nodes (18.0.4) have feature flag ci_only_one_persistent_ref_creation enabled by default (FF was removed)
Sidekiq nodes (17.11.6) run with the feature flag disabled
When pipeline completes, ref cleanup in Sidekiq doesn't delete the pipeline_persistent_ref_cache_key Redis key because the cleanup code is gated behind the feature flag (https://gitlab.com/gitlab-org/gitlab/-/blob/v17.11.6-ee/app/services/ci/pipelines/clear_persistent_ref_service.rb#L7-L8)
On job retry, CreatePersistentRefService runs in Rails with version 18.0.4, it finds the Redis key and skips ref creation, but the actual Git ref actually cleaned up.

Environment

GitLab version: 17.11.6 → 18.0.4
Deployment type: HA (Multi-node)
Upgrade method: Zero-downtime upgrade

Additional context

This issue was introduced by MR !182565 (merged) which added reliance on Redis cache for ref recreation. The feature flag pipeline_persistent_ref_cache_key was removed in 18.0 via MR !187651 (merged), creating the version mismatch scenario during zero-downtime upgrades.

Edited Aug 18, 2025 by 🤖 GitLab Bot 🤖