Treat `stuck_or_timeout_failure` and `job_execution_timeout` as retry:when aliases for the new specific failure reasons
## Summary In %19.0, !230787 (closes #595752) split the generic `stuck_or_timeout_failure` and `job_execution_timeout` failure reasons into a set of more specific ones emitted by the various `Ci::StuckBuilds::*` and `Ci::TimedOutBuilds::*` services: | Previous reason | New reasons | |---|---| | `stuck_or_timeout_failure` | `stuck_pending_with_matching_runners`, `stuck_pending_no_matching_runners`, `no_updates_running`, `no_updates_canceling` | | `job_execution_timeout` | `server_timeout_running`, `server_timeout_canceling` | The original enum values are preserved for historical data, but no new builds are written with them. See [#595752](https://gitlab.com/gitlab-org/gitlab/-/work_items/595752), !230787, and the docs follow-up !237556 for full context. ## Problem The old reasons are valid values for [`retry:when`](https://docs.gitlab.com/ci/yaml/#retrywhen) in `.gitlab-ci.yml`. `Gitlab::Ci::Config::Entry::Retry::FullRetry.possible_retry_when_values` derives its allow-list from `Ci::Build.failure_reasons.keys`, so a config like: ```yaml job: script: ./run.sh retry: max: 2 when: - stuck_or_timeout_failure - job_execution_timeout ``` still **validates** after the upgrade to %19.0, but it silently **stops matching** any real failures — because the dropper services now write one of the six new reasons instead. Customers who relied on this retry behavior get a silent regression with no warning, no error, and no failed pipeline to alert them. This was flagged as a likely breaking change for anyone consuming failure reasons via the [Jobs API](https://docs.gitlab.com/api/jobs/) or `retry:when`, see [this discussion on #595752](https://gitlab.com/gitlab-org/gitlab/-/work_items/595752#note_3250739522) and drew's [follow-up note](https://gitlab.com/gitlab-org/gitlab/-/work_items/595752#note_3377814112). We can't fully remove these values — they're in customer `.gitlab-ci.yml` files we don't control — but we also don't want them to mean "nothing" going forward. ## Proposal Make `stuck_or_timeout_failure` and `job_execution_timeout` behave as **meta-reasons / aliases** in `retry:when` matching. When a user lists either of them under `retry:when`, the retry logic should match against the full set of new, specific reasons that replaced it: - `stuck_or_timeout_failure` → matches a build that failed with **any** of: - `stuck_pending_with_matching_runners` - `stuck_pending_no_matching_runners` - `no_updates_running` - `no_updates_canceling` - `job_execution_timeout` → matches a build that failed with **any** of: - `server_timeout_running` - `server_timeout_canceling` This preserves the original semantic intent of these names ("retry me if I got stuck or timed out") for every existing config, without locking us into emitting the old reasons on new builds. The implementation should live wherever `retry:when` matching is evaluated against a build's `failure_reason` (the auto-retry logic, not the YAML validator — the validator is already fine since the keys remain in the enum). A central alias map seems cleanest so it can be reused if we do this kind of split again. ### Deprecation warning Alongside the alias behavior, we should warn users that these names are deprecated and they should migrate to the specific reasons: - Emit a CI lint / config warning (non-blocking) when `stuck_or_timeout_failure` or `job_execution_timeout` appear under `retry:when`, pointing at the new reasons and the docs. - The warning should be surfaced in the same places existing CI config warnings show up (pipeline editor, `/ci/lint`, the lint API response). - Add a deprecation entry under `data/deprecations/` so this lands in the release post and gives self-managed customers lead time. ## Out of scope - Removing `stuck_or_timeout_failure` / `job_execution_timeout` from the enum or from `possible_retry_when_values`. Both must stay valid for backward compatibility. - Re-emitting the old reasons on new builds. The split in !230787 is intentional and we want the granular data; this issue is only about how `retry:when` interprets the old names. ## Acceptance criteria - [ ] `retry:when: [stuck_or_timeout_failure]` triggers a retry when a build fails with any of `stuck_pending_with_matching_runners`, `stuck_pending_no_matching_runners`, `no_updates_running`, or `no_updates_canceling`. - [ ] `retry:when: [job_execution_timeout]` triggers a retry when a build fails with any of `server_timeout_running` or `server_timeout_canceling`. - [ ] The existing per-reason behavior is preserved — listing a specific new reason still only matches that reason. - [ ] A non-blocking deprecation warning is shown in CI lint output / pipeline editor when these legacy reasons are used in `retry:when`. - [ ] A deprecation notice is added under `data/deprecations/`. - [ ] Docs in `doc/ci/yaml/_index.md` and `doc/ci/jobs/job_troubleshooting.md` (see !237556, !237605) are updated to describe the alias behavior and point users to the new reasons. - [ ] Test coverage for the alias matching and the deprecation warning. ## References - Original issue: [#595752](https://gitlab.com/gitlab-org/gitlab/-/work_items/595752) - Implementation MR: !230787 - Docs MRs: !237556, !237605 - Drew's [to-do comment](https://gitlab.com/gitlab-org/gitlab/-/work_items/595752#note_3377814112) that initially raised bringing this back as a meta-reason for `retry:when` - Customer-impact discussion: [note_3250739522 on #595752](https://gitlab.com/gitlab-org/gitlab/-/work_items/595752#note_3250739522) - Relevant code: `lib/gitlab/ci/config/entry/retry.rb`, `app/models/concerns/enums/ci/commit_status.rb`, `app/services/ci/stuck_builds/`, `app/services/ci/timed_out_builds/`
issue