Skip to content

Zero-downtime upgrade causes pipeline job retries to fail with missing ref error

Summary

Pipeline job retries fail with fatal: couldn't find remote ref refs/pipelines/<ID> error during Zero-downtime upgrades from 17.11.6 to 18.0..

Steps to reproduce

  1. Set up a GitLab HA environment running version 17.11.6
  2. Perform zero-downtime upgrade to 18.0.4 following the recommended order (Rails nodes first, then Sidekiq nodes)
  3. After Rails nodes are upgraded but before Sidekiq nodes are upgraded:
    • Run a pipeline to completion
    • Retry a job from the completed pipeline

What is the current bug behavior?

Job retries fail with error: fatal: couldn't find remote ref refs/pipelines/<ID>

What is the expected correct behavior?

Job retries should work normally during zero-downtime upgrades, regardless of the order in which nodes are upgraded.

Relevant logs and/or screenshots

fatal: couldn't find remote ref refs/pipelines/4740

Possible fixes

The issue stems from inconsistent behavior between Rails and Sidekiq nodes during the upgrade window:

Environment

  • GitLab version: 17.11.6 → 18.0.4
  • Deployment type: HA (Multi-node)
  • Upgrade method: Zero-downtime upgrade

Additional context

This issue was introduced by MR !182565 (merged) which added reliance on Redis cache for ref recreation. The feature flag pipeline_persistent_ref_cache_key was removed in 18.0 via MR !187651 (merged), creating the version mismatch scenario during zero-downtime upgrades.

Edited by 🤖 GitLab Bot 🤖