Add ConfigurationError and ExitCodeInvalidConfiguration to GitLab Runner Build Error Classifications
Problem to solve
CI Jobs should not report runner_system_failure when the problem is a CI job configuration issue
Background
On GitLab.com, and potentially soon, GitLab Dedicated Hosted Runners, SLIs and some SLAs ultimately rely on the runner_system_failure to record a 5xx-style "server error" caused by the GitLab Runners system, as attributable to GitLab. This, as opposed to 4xx-style "client error" caused by a client issue.
However, there are many cases were client misconfiguration can be recorded as a runner_system_failure.
This leads to unnecessary error budget leak, false positive alerts for EOCs and in the worst case, contractual SLA noncompliance (with potential financial implications for GitLab).
Examples of incorrect attribution of client configuration failures with runner_system_failure
- Incorrect vault secret references in .gitlab-ci.ymljob definitions. If a client job references an incorrect vault secret, the job fails withrunner_system_failure. For instance https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/sandbox/byok-test-keys/-/jobs/8903327292 (internal visibility only)
- Anecdotally, I've been told that certain DDOS-style failures, such as a bash fork bomb, lead to runner_system_failure, but 1) this is not confirmed and 2) admittedly harder to solve.
Proposal(s)
- Option 1: Ensure that the error attribute runner_system_failureis only attributed to "own fault" server errors, not client configuration error.
- Option 2: Introduce a new error type which can be used to distinguish client errors from server errors.
cc @o-lluch @amknight @gabrielengel_gl @tmaczukin @josephburnett
Edited  by Darren Eastman