Clarify and differentiate stuck_or_timeout job failures
## Problem Currently, jobs that fail due to being stuck or timing out are classified with a single failure reason: `stuck_or_timeout_failure`. This blanket classification obscures important distinctions about the job's state when it was terminated, making it difficult to: - Understand whether a job was actually stuck (no activity) or simply exceeded a timeout threshold - Differentiate between jobs that timed out due to resource constraints vs. infrastructure issues - Provide users with actionable feedback about why their job failed - Build accurate observability and alerting around job execution failures - Properly categorize these failures in the broader job failure reason standardization effort ([#595703](https://gitlab.com/gitlab-org/gitlab/-/work_items/595703)) ## Current State The `stuck_or_timeout_failure` reason is applied by `StuckCiJobsWorker` when it terminates jobs that have been running too long or appear stuck. However, there's no distinction in the failure reason about: - What state the job was in when it was selected for termination - Whether it was actually stuck (no activity) or simply exceeded a time limit - What specific timeout or stuck condition triggered the failure ## Desired Outcomes 1. **Clarified failure reasons**: Any build currently failing as "stuck or timed out" should have a failure reason that clarifies what its state was when it was selected for drop/failure. This includes: - Jobs that exceeded a maximum execution time - Jobs that showed no activity for an extended period - Jobs that were stuck in a specific state (e.g., waiting for runner, waiting for artifacts) 2. **Enhanced CI::Build failure reasons**: Add new failure reasons to `Ci::Build` that explicitly indicate when a job was failed by `StuckCiJobsWorker`, with sufficient detail to understand the termination reason. ## Technical Implementation **Mapping to new Failure Reasons** The existing `stuck_or_timeout_failure` reason is replaced based on the specific service that drops the job: | Service | Previous Reason | New Reason | Timing | Description | |---------|-----------------|------------|------|-------------| | `Ci::StuckBuilds::DropPendingService` (outdated timeout) | `stuck_or_timeout_failure` | `stuck_pending_with_matching_runners` | 24h in pending | Job was stuck in pending state even though matching runners were available. Indicates potential job configuration issues or transient infrastructure problems. | | `Ci::StuckBuilds::DropPendingService` (stuck timeout, from queue) | `stuck_or_timeout_failure` | `stuck_pending_no_matching_runners` | 1h in pending | Job was stuck in pending state because no runners matched the job's requirements (tags, protected status, etc.). Indicates runner configuration mismatch. | | `Ci::StuckBuilds::DropRunningService` | `stuck_or_timeout_failure` | `no_updates_running` | 1h of Runner silence | Job was in running state but showed no activity for an extended period. Indicates the job script may have hung or the runner lost connectivity. | | `Ci::StuckBuilds::DropCancelingService` | `stuck_or_timeout_failure` | `no_updates_canceling` | 1h of Runner silence | Job was in canceling state but showed no activity for an extended period. The cancellation process stalled. | | `Ci::TimedOutBuilds::DropRunningService` | `job_execution_timeout` | `server_timeout_running` | Configured timeout + 15m | Running job exceeded the maximum execution time configured for the job. Server-side enforcement of job timeout. | | `Ci::TimedOutBuilds::DropCancelingService` | `job_execution_server_timeout` | `server_timeout_canceling` | Configured timeout + 15m | Canceling job exceeded the maximum execution time. Server-side enforcement during cancellation. | **Backward Compatibility** The existing `stuck_or_timeout_failure` enum value (ID 4) is preserved in the codebase for historical data compatibility. New jobs will use the more specific failure reasons going forward. ## Acceptance Criteria - [x] Audit current usage of `stuck_or_timeout_failure` in the codebase - [ ] Define new, mutually exclusive failure reasons that replace `stuck_or_timeout_failure` - [ ] Update `StuckCiJobsWorker` to use the new, more specific failure reasons based on job state - [ ] Document the new failure reasons and when each should be applied - [ ] Update any related observability/dashboards to reflect the new failure reasons - [ ] Ensure changes align with the broader job failure reason standardization in [#595703](https://gitlab.com/gitlab-org/gitlab/-/work_items/595703) ## Related Work This proposal is part of the broader effort to standardize and clarify CI job failure reasons: [#595703](https://gitlab.com/gitlab-org/gitlab/-/work_items/595703)
issue