Treat `stuck_or_timeout_failure` and `job_execution_timeout` as retry:when aliases for the new specific failure reasons

Summary

In %19.0, !230787 (merged) (closes #595752 (closed)) split the generic stuck_or_timeout_failure and job_execution_timeout failure reasons into a set of more specific ones emitted by the various Ci::StuckBuilds::* and Ci::TimedOutBuilds::* services:

Previous reason New reasons
stuck_or_timeout_failure stuck_pending_with_matching_runners, stuck_pending_no_matching_runners, no_updates_running, no_updates_canceling
job_execution_timeout server_timeout_running, server_timeout_canceling

The original enum values are preserved for historical data, but no new builds are written with them. See #595752 (closed), !230787 (merged), and the docs follow-up !237556 for full context.

Problem

The old reasons are valid values for retry:when in .gitlab-ci.yml. Gitlab::Ci::Config::Entry::Retry::FullRetry.possible_retry_when_values derives its allow-list from Ci::Build.failure_reasons.keys, so a config like:

job:
  script: ./run.sh
  retry:
    max: 2
    when:
      - stuck_or_timeout_failure
      - job_execution_timeout

still validates after the upgrade to %19.0, but it silently stops matching any real failures — because the dropper services now write one of the six new reasons instead. Customers who relied on this retry behavior get a silent regression with no warning, no error, and no failed pipeline to alert them. This was flagged as a likely breaking change for anyone consuming failure reasons via the Jobs API or retry:when, see this discussion on #595752 and drew's follow-up note.

We can't fully remove these values — they're in customer .gitlab-ci.yml files we don't control — but we also don't want them to mean "nothing" going forward.

Proposal

Make stuck_or_timeout_failure and job_execution_timeout behave as meta-reasons / aliases in retry:when matching. When a user lists either of them under retry:when, the retry logic should match against the full set of new, specific reasons that replaced it:

  • stuck_or_timeout_failure → matches a build that failed with any of:
    • stuck_pending_with_matching_runners
    • stuck_pending_no_matching_runners
    • no_updates_running
    • no_updates_canceling
  • job_execution_timeout → matches a build that failed with any of:
    • server_timeout_running
    • server_timeout_canceling

This preserves the original semantic intent of these names ("retry me if I got stuck or timed out") for every existing config, without locking us into emitting the old reasons on new builds.

The implementation should live wherever retry:when matching is evaluated against a build's failure_reason (the auto-retry logic, not the YAML validator — the validator is already fine since the keys remain in the enum). A central alias map seems cleanest so it can be reused if we do this kind of split again.

Deprecation warning

Alongside the alias behavior, we should warn users that these names are deprecated and they should migrate to the specific reasons:

  • Emit a CI lint / config warning (non-blocking) when stuck_or_timeout_failure or job_execution_timeout appear under retry:when, pointing at the new reasons and the docs.
  • The warning should be surfaced in the same places existing CI config warnings show up (pipeline editor, /ci/lint, the lint API response).
  • Add a deprecation entry under data/deprecations/ so this lands in the release post and gives self-managed customers lead time.

Out of scope

  • Removing stuck_or_timeout_failure / job_execution_timeout from the enum or from possible_retry_when_values. Both must stay valid for backward compatibility.
  • Re-emitting the old reasons on new builds. The split in !230787 (merged) is intentional and we want the granular data; this issue is only about how retry:when interprets the old names.

Acceptance criteria

  • retry:when: [stuck_or_timeout_failure] triggers a retry when a build fails with any of stuck_pending_with_matching_runners, stuck_pending_no_matching_runners, no_updates_running, or no_updates_canceling.
  • retry:when: [job_execution_timeout] triggers a retry when a build fails with any of server_timeout_running or server_timeout_canceling.
  • The existing per-reason behavior is preserved — listing a specific new reason still only matches that reason.
  • A non-blocking deprecation warning is shown in CI lint output / pipeline editor when these legacy reasons are used in retry:when.
  • A deprecation notice is added under data/deprecations/.
  • Docs in doc/ci/yaml/_index.md and doc/ci/jobs/job_troubleshooting.md (see !237556, !237605) are updated to describe the alias behavior and point users to the new reasons.
  • Test coverage for the alias matching and the deprecation warning.

References

Edited by 🤖 GitLab Bot 🤖