Skip to content

Improve on-call training for self-managed incidents

Context

As a follow-up to one of the action items mentioned here, we should improve the on-call training material for Gitaly engineers to cover scenarios involving self-managed customers.

Unlike .com, we don't get nearly the same amount of visibility into how Gitaly is operating on a self-managed instance. In the linked incident above, we received periodic gitlabsos dumps, summarisations from fast-stats, and ad-hoc output from ps and other commands in order to troubleshoot the problem. The investigation was not as structured as it would've been for a .com incident.

Grafana is also often not configured in self-managed instances, so visualising performance over time can be tricky.

Proposal

Work with Support to devise additional on-call training that covers what we should expect from a self-managed customer emergency. For reference, https://gitlab.com/gitlab-org/gitaly/-/issues/6370 is the training issue we've used to prepare for .com incidents.

cc: @anton

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information