Clarify and differentiate stuck_or_timeout job failures

Problem

Currently, jobs that fail due to being stuck or timing out are classified with a single failure reason: stuck_or_timeout_failure. This blanket classification obscures important distinctions about the job's state when it was terminated, making it difficult to:

  • Understand whether a job was actually stuck (no activity) or simply exceeded a timeout threshold
  • Differentiate between jobs that timed out due to resource constraints vs. infrastructure issues
  • Provide users with actionable feedback about why their job failed
  • Build accurate observability and alerting around job execution failures
  • Properly categorize these failures in the broader job failure reason standardization effort (#595703 (closed))

Current State

The stuck_or_timeout_failure reason is applied by StuckCiJobsWorker when it terminates jobs that have been running too long or appear stuck. However, there's no distinction in the failure reason about:

  • What state the job was in when it was selected for termination
  • Whether it was actually stuck (no activity) or simply exceeded a time limit
  • What specific timeout or stuck condition triggered the failure

Desired Outcomes

  1. Clarified failure reasons: Any build currently failing as "stuck or timed out" should have a failure reason that clarifies what its state was when it was selected for drop/failure. This includes:

    • Jobs that exceeded a maximum execution time
    • Jobs that showed no activity for an extended period
    • Jobs that were stuck in a specific state (e.g., waiting for runner, waiting for artifacts)
  2. Enhanced CI::Build failure reasons: Add new failure reasons to Ci::Build that explicitly indicate when a job was failed by StuckCiJobsWorker, with sufficient detail to understand the termination reason.

Technical Implementation

Mapping to new Failure Reasons

The existing stuck_or_timeout_failure reason is replaced based on the specific service that drops the job:

Service Previous Reason New Reason Timing Description
Ci::StuckBuilds::DropPendingService (outdated timeout) stuck_or_timeout_failure stuck_pending_with_matching_runners 24h in pending Job was stuck in pending state even though matching runners were available. Indicates potential job configuration issues or transient infrastructure problems.
Ci::StuckBuilds::DropPendingService (stuck timeout, from queue) stuck_or_timeout_failure stuck_pending_no_matching_runners 1h in pending Job was stuck in pending state because no runners matched the job's requirements (tags, protected status, etc.). Indicates runner configuration mismatch.
Ci::StuckBuilds::DropRunningService stuck_or_timeout_failure no_updates_running 1h of Runner silence Job was in running state but showed no activity for an extended period. Indicates the job script may have hung or the runner lost connectivity.
Ci::StuckBuilds::DropCancelingService stuck_or_timeout_failure no_updates_canceling 1h of Runner silence Job was in canceling state but showed no activity for an extended period. The cancellation process stalled.
Ci::TimedOutBuilds::DropRunningService job_execution_timeout server_timeout_running Configured timeout + 15m Running job exceeded the maximum execution time configured for the job. Server-side enforcement of job timeout.
Ci::TimedOutBuilds::DropCancelingService job_execution_server_timeout server_timeout_canceling Configured timeout + 15m Canceling job exceeded the maximum execution time. Server-side enforcement during cancellation.

Backward Compatibility

The existing stuck_or_timeout_failure enum value (ID 4) is preserved in the codebase for historical data compatibility. New jobs will use the more specific failure reasons going forward.

Acceptance Criteria

  • Audit current usage of stuck_or_timeout_failure in the codebase
  • Define new, mutually exclusive failure reasons that replace stuck_or_timeout_failure
  • Update StuckCiJobsWorker to use the new, more specific failure reasons based on job state
  • Document the new failure reasons and when each should be applied
  • Update any related observability/dashboards to reflect the new failure reasons
  • Ensure changes align with the broader job failure reason standardization in #595703 (closed)

This proposal is part of the broader effort to standardize and clarify CI job failure reasons: #595703 (closed)

Edited by drew stachon