Provide type of runner job failure

Problem

In the incident gitlab-com/gl-infra/production#17636 (closed), the CI jobs landed on a new manager having configuration issue failed. After fixing the config, we wanted to monitor the runner and made sure the jobs could succeed. However, as pointed out in gitlab-com/gl-infra/production#17636 (comment 1783688643), we wouldn't be able to tell the cause of the failure unless we looked into the details.

We have a job failure by reason graph for our runner managers. In the case of gitlab-com/gl-infra/production#17636 (closed), the runners were timing out on trying to clone the repo, and these showed up as script_failure. script_failure typically are not related to GitLab.com infrastructure, so it wouldn't be wise to alert on them. The failures in gitlab-com/gl-infra/production#17636 (closed) were detected via the graphs, but were decided to not be important because we have little control over script_failure errors.

I think it might be helpful to have a different step for cloning that would be separate from any user caused errors such as script_failure, and as such we could alert on.

Proposal

Separate script_failure errors because the "project couldn't be cloned" from failures that happen during script execution.

Edited Mar 01, 2024 by Darren Eastman