Standardize and clarify CI job failure reasons
<details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Label this issue](https://contributors.gitlab.com/manage-issue?action=label&projectId=278964&issueIid=595703) </details> ## Problem Job failure reasons in the codebase are not crisp or mutually exclusive. They overlap and lack clear categorization, making it difficult to: - Distinguish between user-error/misconfiguration vs infrastructure issues - Build reliable alerting and incident detection - Provide actionable feedback to users ## Current State Job failure reasons are scattered and inconsistent across the codebase. We need to audit and standardize them. ![image](/uploads/33696706851b7c7660eda5c3689f0157/image.png){width=900 height=511} ### Failure Reason Classification Verify engineers: please classify each failure reason as **Customer based**, **GitLab infrastructure based**, or **Unclear or a combination** by filling in the "Classification" column. | Failure Reason | Classification | Notes | |---|---|---| | `api_failure` | Customer | ~~Major contributor~~ excluded | | `data_integrity_failure` | **System** | Major contributor | | `deployment_rejected` | Customer | | | `downstream_bridge_project_not_found` | Customer | | | `downstream_pipeline_creation_failed` | Customer | System-related failure would show up as a failure in the target project | | `duo_workflow_not_allowed` | Customer | Successful guardrail — feature/entitlement gate, not a malfunction | | `failed_outdated_deployment_job` | Customer | | | `insufficient_bridge_permissions` | Customer | | | `invalid_bridge_trigger` | Customer | Invalid downstream/upstream trigger configuration in pipeline YAML | | `ip_restriction_failure` | Customer | Admin-configured network restriction | | `job_execution_timeout` | Customer | User-configured timeout (default 1hr); Runner IS healthy and heartbeating. If Runner stops heartbeating, the job gets `stuck_or_timeout_failure` instead. | | `job_token_expired` | Customer | CI job token TTL exceeded — overwhelmingly caused by long-running jobs. Edge case: processing delays could contribute, but rare. | | `missing_dependency_failure` | Customer | Job depends on artifacts from another job that failed or expired — pipeline configuration issue | | `no_matching_runner` | Customer | Despite the name, this is the EE **plan-gating** check (`runner.allowed_plans` vs project subscription plan), set immediately at pipeline creation. NOT "no runner with matching tags" — that scenario results in `stuck_or_timeout_failure` after the stuck-builds cron. | | `protected_environment_failure` | Customer | Job targets a protected environment the user doesn't have access to deploy to | | `reached_max_descendant_pipelines_depth` | Customer | Successful guardrail — prevents runaway pipeline chains | | `runner_system_failure` | **System** | Major contributor — returned any time there is a panic/crash on the Runner, or if it doesn't manage to send traces. | | `runner_unsupported` | **Combination** | Runner doesn't advertise a feature the job requires (e.g. `multi_build_steps`, `job_inputs`, `fallback_cache_keys`). Checked via `Build#supported_runner?` against `RUNNER_FEATURES`. For SaaS shared runners this is System (GitLab controls the fleet); for self-managed/project runners it's Customer (operator controls upgrades/capabilities). | | `scheduler_failure` | **System** | Sidekiq scheduling failure — internal infrastructure issue | | `secrets_provider_not_found` | Customer | External secrets provider (e.g. Vault) not configured or unreachable — user/admin configuration | | `stale_schedule` | **System** | Delayed job could not be executed — a healthy system should run scheduled jobs on time | | `stuck_or_timeout_failure` | **Combination** | Major contributor, being split into specific sub-reasons in https://gitlab.com/gitlab-org/gitlab/-/work_items/595752. Will include System reasons (e.g. runner went silent mid-execution) and Customer reasons (e.g. no runner with matching tags). | | `trace_size_exceeded` | Customer | Job log exceeds max trace size limit — script produces too much output | | `unknown_failure` | **System** (conservative) | Major contributor — this is a catch-all for [failure reasons not known to the GitLab server](https://gitlab.com/gitlab-org/gitlab-runner/blob/b4cc0820d97548dff98336d498b92ad4026c4404/common/failure_reason_mapper.go#L58-60) (`features.failure_reasons`). See also [Build.setTraceStatus](https://gitlab.com/gitlab-org/gitlab-runner/blob/b4cc0820d97548dff98336d498b92ad4026c4404/common/build.go#L950-956). Cannot attribute to customer, so conservatively counted as System. | | `unmet_prerequisites` | Customer | Kubernetes/environment setup prerequisites not met before job can run — deployment configuration issue | | `upstream_bridge_project_not_found` | Customer | Upstream bridge references a project that doesn't exist or user lacks access | ## Handling for each type of failure - **Customer-based job failure:** Should not appear on this panel: https://dashboards.gitlab.net/d/pp6cq8v/temp3a-pipeline-observability?viewPanel=13&orgId=1&from=now-24h&to=now&timezone=utc&var-environment=gprd&var-stage=main - **GitLab-based job failure:** Should appear in this panel - **Combination or unknown states:** Should have their usage split so we know (at least) whether or not GitLab's infrastructure is responsible for the failure. ## Desired State 1.Failure reasons should be clearly categorized into: - **User Error/Misconfiguration** - User's job config, script, or setup is incorrect - **Infrastructure/Outage** - GitLab.com or runner infrastructure issue - **Resource Constraints** - Timeout, memory, disk space, etc. - **System Error** - Unexpected internal failure 2. The job-failure related panels in the "Job Execution" section of [this dashboard](https://dashboards.gitlab.net/d/pp6cq8v/temp3a-pipeline-observability?orgId=1&from=now-3h&to=now&timezone=utc&var-environment=gprd&var-stage=main) should only reflect errors that GitLab caused, and could have prevented through more resilient or reliable operations. 3. Each reason should be: - Mutually exclusive - Clearly documented - Consistently applied across the codebase ## Acceptance Criteria - [ ] Audit all job failure reasons currently codified in gitlab-org/gitlab - [ ] Document overlapping or ambiguous reasons - [ ] Define clear, non-overlapping categories - [ ] Create mapping of existing reasons to new categories - [ ] Update codebase to use standardized reasons - [ ] Add documentation for engineers on when to use each reason
issue