Sanity checks or post deployment monitoring of runners
<!-- This template is for GitLab Team Members seeking support of SRE where there isn't an existing `request-*` template available. Please fill out the details below. --> **Details** - Point of contact for this request: [+ @user +] - If a call is needed, what is the proposed date and time of the call: [+ Date and Time +] - Additional call details (format, type of call): [+ additional details +] **SRE Support Needed** [+ Support Request Details +] <!-- Please do not edit the below --> Incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17422 was caused by a runner deployment https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4355 that caused docker in docker to stop working, we wernt aware of this for an hour until it was mentioned in slack. The change had a growing scope of customer impact (as more runners were deployed), resulting in the service being unusable in its final stages. Ideally we'd like to be alerted of this in some capacity before it reaches this stage. - https://gitlab.com/gitlab-org/gitlab/-/issues/438688 - https://gitlab.com/gitlab-org/gitlab-runner/-/issues/37325 This metric is for script_failures across sass runners; Its not suitable for an automated alert by itself, but is probably a good place to start: [thanos](https://thanos.gitlab.net/graph?g0.expr=sum%20by%20(instance%2Cfailure_reason)%0A(%0A%20%20increase(gitlab_runner_failed_jobs_total%7Benvironment%3D~%22gprd%22%2Cstage%3D~%22main%22%2Cinstance%3D~%22runners-manager-saas-linux-small-amd64-blue-1.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-blue-2.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-blue-3.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-blue-4.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-blue-5.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-green-1.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-green-2.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-green-3.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-green-4.c.gitlab-ci-155816.internal.*%7Crunners-manager-saas-linux-small-amd64-green-5.c.gitlab-ci-155816.internal.*%22%2Cfailure_reason%3D~%22script_failure%22%7D%5B5m%5D)%0A)&g0.tab=0&g0.stacked=0&g0.range_input=12h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D)
issue