Clarify and differentiate stuck_or_timeout job failures
Problem
Currently, jobs that fail due to being stuck or timing out are classified with a single failure reason: stuck_or_timeout_failure. This blanket classification obscures important distinctions about the job's state when it was terminated, making it difficult to:
- Understand whether a job was actually stuck (no activity) or simply exceeded a timeout threshold
- Differentiate between jobs that timed out due to resource constraints vs. infrastructure issues
- Provide users with actionable feedback about why their job failed
- Build accurate observability and alerting around job execution failures
- Properly categorize these failures in the broader job failure reason standardization effort (#595703 (closed))
Current State
The stuck_or_timeout_failure reason is applied by StuckCiJobsWorker when it terminates jobs that have been running too long or appear stuck. However, there's no distinction in the failure reason about:
- What state the job was in when it was selected for termination
- Whether it was actually stuck (no activity) or simply exceeded a time limit
- What specific timeout or stuck condition triggered the failure
Desired Outcomes
-
Clarified failure reasons: Any build currently failing as "stuck or timed out" should have a failure reason that clarifies what its state was when it was selected for drop/failure. This includes:
- Jobs that exceeded a maximum execution time
- Jobs that showed no activity for an extended period
- Jobs that were stuck in a specific state (e.g., waiting for runner, waiting for artifacts)
-
Enhanced CI::Build failure reasons: Add new failure reasons to
Ci::Buildthat explicitly indicate when a job was failed byStuckCiJobsWorker, with sufficient detail to understand the termination reason.
Technical Implementation
Mapping to new Failure Reasons
The existing stuck_or_timeout_failure reason is replaced based on the specific service that drops the job:
| Service | Previous Reason | New Reason | Timing | Description |
|---|---|---|---|---|
Ci::StuckBuilds::DropPendingService (outdated timeout) |
stuck_or_timeout_failure |
stuck_pending_with_matching_runners |
24h in pending | Job was stuck in pending state even though matching runners were available. Indicates potential job configuration issues or transient infrastructure problems. |
Ci::StuckBuilds::DropPendingService (stuck timeout, from queue) |
stuck_or_timeout_failure |
stuck_pending_no_matching_runners |
1h in pending | Job was stuck in pending state because no runners matched the job's requirements (tags, protected status, etc.). Indicates runner configuration mismatch. |
Ci::StuckBuilds::DropRunningService |
stuck_or_timeout_failure |
no_updates_running |
1h of Runner silence | Job was in running state but showed no activity for an extended period. Indicates the job script may have hung or the runner lost connectivity. |
Ci::StuckBuilds::DropCancelingService |
stuck_or_timeout_failure |
no_updates_canceling |
1h of Runner silence | Job was in canceling state but showed no activity for an extended period. The cancellation process stalled. |
Ci::TimedOutBuilds::DropRunningService |
job_execution_timeout |
server_timeout_running |
Configured timeout + 15m | Running job exceeded the maximum execution time configured for the job. Server-side enforcement of job timeout. |
Ci::TimedOutBuilds::DropCancelingService |
job_execution_server_timeout |
server_timeout_canceling |
Configured timeout + 15m | Canceling job exceeded the maximum execution time. Server-side enforcement during cancellation. |
Backward Compatibility
The existing stuck_or_timeout_failure enum value (ID 4) is preserved in the codebase for historical data compatibility. New jobs will use the more specific failure reasons going forward.
Acceptance Criteria
- Audit current usage of
stuck_or_timeout_failurein the codebase - Define new, mutually exclusive failure reasons that replace
stuck_or_timeout_failure - Update
StuckCiJobsWorkerto use the new, more specific failure reasons based on job state - Document the new failure reasons and when each should be applied
- Update any related observability/dashboards to reflect the new failure reasons
- Ensure changes align with the broader job failure reason standardization in #595703 (closed)
Related Work
This proposal is part of the broader effort to standardize and clarify CI job failure reasons: #595703 (closed)