Skip to content

Stop considering Docker image pull as runner system failure

Overview

In gitlab-com/gl-infra/production#4649 (closed) we saw a spike of system_failures because it failed to pull Docker images for example https://gitlab.com/steveazz/playground/-/jobs/1278831109. For GitLab.com we have an SLI that checks the error rate of runner_system_failure. The image keyword is something that the user controls so a single user as we see in gitlab-com/gl-infra/production#4649 (closed) can trigger this SLI with an image that doesn't exist and there is no action from us to take.

Proposal

When we fail to pull an image it shouldn't be considered as a runner system failure

Edited by Steve Xuereb