Treat `stuck_or_timeout_failure` and `job_execution_timeout` as retry:when aliases for the new specific failure reasons
Summary
In %19.0, !230787 (merged) (closes #595752 (closed)) split the generic stuck_or_timeout_failure and job_execution_timeout failure reasons into a set of more specific ones emitted by the various Ci::StuckBuilds::* and Ci::TimedOutBuilds::* services:
| Previous reason | New reasons |
|---|---|
stuck_or_timeout_failure |
stuck_pending_with_matching_runners, stuck_pending_no_matching_runners, no_updates_running, no_updates_canceling |
job_execution_timeout |
server_timeout_running, server_timeout_canceling |
The original enum values are preserved for historical data, but no new builds are written with them. See #595752 (closed), !230787 (merged), and the docs follow-up !237556 for full context.
Problem
The old reasons are valid values for retry:when in .gitlab-ci.yml. Gitlab::Ci::Config::Entry::Retry::FullRetry.possible_retry_when_values derives its allow-list from Ci::Build.failure_reasons.keys, so a config like:
job:
script: ./run.sh
retry:
max: 2
when:
- stuck_or_timeout_failure
- job_execution_timeoutstill validates after the upgrade to %19.0, but it silently stops matching any real failures — because the dropper services now write one of the six new reasons instead. Customers who relied on this retry behavior get a silent regression with no warning, no error, and no failed pipeline to alert them. This was flagged as a likely breaking change for anyone consuming failure reasons via the Jobs API or retry:when, see this discussion on #595752 and drew's follow-up note.
We can't fully remove these values — they're in customer .gitlab-ci.yml files we don't control — but we also don't want them to mean "nothing" going forward.
Proposal
Make stuck_or_timeout_failure and job_execution_timeout behave as meta-reasons / aliases in retry:when matching. When a user lists either of them under retry:when, the retry logic should match against the full set of new, specific reasons that replaced it:
stuck_or_timeout_failure→ matches a build that failed with any of:stuck_pending_with_matching_runnersstuck_pending_no_matching_runnersno_updates_runningno_updates_canceling
job_execution_timeout→ matches a build that failed with any of:server_timeout_runningserver_timeout_canceling
This preserves the original semantic intent of these names ("retry me if I got stuck or timed out") for every existing config, without locking us into emitting the old reasons on new builds.
The implementation should live wherever retry:when matching is evaluated against a build's failure_reason (the auto-retry logic, not the YAML validator — the validator is already fine since the keys remain in the enum). A central alias map seems cleanest so it can be reused if we do this kind of split again.
Deprecation warning
Alongside the alias behavior, we should warn users that these names are deprecated and they should migrate to the specific reasons:
- Emit a CI lint / config warning (non-blocking) when
stuck_or_timeout_failureorjob_execution_timeoutappear underretry:when, pointing at the new reasons and the docs. - The warning should be surfaced in the same places existing CI config warnings show up (pipeline editor,
/ci/lint, the lint API response). - Add a deprecation entry under
data/deprecations/so this lands in the release post and gives self-managed customers lead time.
Out of scope
- Removing
stuck_or_timeout_failure/job_execution_timeoutfrom the enum or frompossible_retry_when_values. Both must stay valid for backward compatibility. - Re-emitting the old reasons on new builds. The split in !230787 (merged) is intentional and we want the granular data; this issue is only about how
retry:wheninterprets the old names.
Acceptance criteria
-
retry:when: [stuck_or_timeout_failure]triggers a retry when a build fails with any ofstuck_pending_with_matching_runners,stuck_pending_no_matching_runners,no_updates_running, orno_updates_canceling. -
retry:when: [job_execution_timeout]triggers a retry when a build fails with any ofserver_timeout_runningorserver_timeout_canceling. - The existing per-reason behavior is preserved — listing a specific new reason still only matches that reason.
- A non-blocking deprecation warning is shown in CI lint output / pipeline editor when these legacy reasons are used in
retry:when. - A deprecation notice is added under
data/deprecations/. - Docs in
doc/ci/yaml/_index.mdanddoc/ci/jobs/job_troubleshooting.md(see !237556, !237605) are updated to describe the alias behavior and point users to the new reasons. - Test coverage for the alias matching and the deprecation warning.
References
- Original issue: #595752 (closed)
- Implementation MR: !230787 (merged)
- Docs MRs: !237556, !237605
- Drew's to-do comment that initially raised bringing this back as a meta-reason for
retry:when - Customer-impact discussion: note_3250739522 on #595752
- Relevant code:
lib/gitlab/ci/config/entry/retry.rb,app/models/concerns/enums/ci/commit_status.rb,app/services/ci/stuck_builds/,app/services/ci/timed_out_builds/