Add ConfigurationError and ExitCodeInvalidConfiguration to GitLab Runner Build Error Classifications

Problem to solve

CI Jobs should not report runner_system_failure when the problem is a CI job configuration issue

Background

On GitLab.com, and potentially soon, GitLab Dedicated Hosted Runners, SLIs and some SLAs ultimately rely on the runner_system_failure to record a 5xx-style "server error" caused by the GitLab Runners system, as attributable to GitLab. This, as opposed to 4xx-style "client error" caused by a client issue.

However, there are many cases were client misconfiguration can be recorded as a runner_system_failure.

This leads to unnecessary error budget leak, false positive alerts for EOCs and in the worst case, contractual SLA noncompliance (with potential financial implications for GitLab).

Examples of incorrect attribution of client configuration failures with `runner_system_failure`

Incorrect vault secret references in .gitlab-ci.yml job definitions. If a client job references an incorrect vault secret, the job fails with runner_system_failure. For instance https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/sandbox/byok-test-keys/-/jobs/8903327292 (internal visibility only)
Anecdotally, I've been told that certain DDOS-style failures, such as a bash fork bomb, lead to runner_system_failure, but 1) this is not confirmed and 2) admittedly harder to solve.

Proposal(s)

Option 1: Ensure that the error attribute runner_system_failure is only attributed to "own fault" server errors, not client configuration error.
Option 2: Introduce a new error type which can be used to distinguish client errors from server errors.

cc @o-lluch @amknight @gabrielengel_gl @tmaczukin @josephburnett

Edited May 02, 2025 by Darren Eastman - OOO Back on 2026-01-05

Add ConfigurationError and ExitCodeInvalidConfiguration to GitLab Runner Build Error Classifications

Problem to solve

Background

Examples of incorrect attribution of client configuration failures with runner_system_failure

Proposal(s)

Examples of incorrect attribution of client configuration failures with `runner_system_failure`