Sanity checks or post deployment monitoring of runners

Details

  • Point of contact for this request: @user
  • If a call is needed, what is the proposed date and time of the call: Date and Time
  • Additional call details (format, type of call): additional details

SRE Support Needed Support Request Details

Incident production#17422 was caused by a runner deployment https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4355 that caused docker in docker to stop working, we wernt aware of this for an hour until it was mentioned in slack.

The change had a growing scope of customer impact (as more runners were deployed), resulting in the service being unusable in its final stages. Ideally we'd like to be alerted of this in some capacity before it reaches this stage.

This metric is for script_failures across sass runners; Its not suitable for an automated alert by itself, but is probably a good place to start: thanos

Edited by Calliope Gardner