Skip to content

Discussion: Oncall onboarding questions

Details

SRE Support Needed

The current onboarding template for going on-call is is https://gitlab.com/gitlab-com/gl-infra/reliability/-/blob/master/.gitlab/issue_templates/onboarding-oncall.md.

I'm about to start on-call and finishing up my copy of that issue, and wanted to clarify a few questions. Once I get some solid answers here, I will update the documentation accordingly. So, let's get this party started!

  1. In multiple places in our documentation, we reference #feed_alerts-general, including most notably on the README for runbooks. However, in multiple shadow shifts, I've never seen anyone reference that channel, and looking at the channel itself, the only person who appears to be interacting with the alerts is @andrewn. Is this a channel we should do something with, or do we need to update our documentation?
  2. One of the tasks on the onboarding issue is 'First drain and then ready connections from one of the zonal clusters in staging.', with a link to https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/frontend/haproxy.md#set-server-state. This documentation doesn't actually cover anything about doing so for a full zonal cluster, only individual hosts. I went and read the scripts in question, and I think it would work fine, but before I go and do so or update the documentation, is this actually a useful thing for on-call? If it is, is there someone who would like to pair with me on doing it next week and then I'll update the docs?
  3. I suspect this is a philosophy one, but it's not written down anywhere I can see. If you're in an incident, and you need to make a production change, do you open a change request, or just handle it as part of the incident?

@knottos and/or @nduff you two are also the newbies, do you have any questions you'd like to add to this, and then we can put all the doc updates in one place?