Redefine outage escalation policies & ownership
In the past the oncall/escalation system used for the DB team was rather vague. For example, Greg and I were on a weekly oncall rotation but @abrandl was not. Worse, when outages happened we didn't always involve DB engineers, unless they were already around. There are four questions we need to answer:
- What is owned by Production, and what is owned by Database?
- How should Production escalate problems to Database?
- How do we deal with a lack of around the clock coverage (both me and @abrandl are in the same timezone)?
- How can we ensure that the problems of last outage (= unorganised chaos) don't happen again?
My rough idea would be the following:
- Ownership: Database owns the PostgreSQL and pgbouncer processes (so the software side of things), but not the underlying hardware. A disk failure is something for Production, but pgbouncer not starting up is something for Database.
- Escalation: Production triages the problem to determine if they can handle it, or if they should escalate it. In light of last outage's chaos I prefer to (at least initially) err on the side of paging Database too often.
- Coverage: on the longer term we're hiring for engineers in the US, but for the short term we might need a Production engineer to fill up the oncall rotation. Having just me and @abrandl on-call isn't very helpful because of the timezone issue, and it can create a lot of stress if we end up getting pinged during the night.
- For the last problem I think we need some kind of "outage workflow" document. We have the runbooks, but they're not super well structured. I'm thinking of some kind of checklist with steps such as "Step 1: Is pgbouncer running", "Step 2: Is the value of counter X greater than Y", etc.
Short term I think the last step would be the most helpful, as this allows Production to deal with outages even when Database is not available. Unfortunately, this requires us to know what that flow would be and what to look for; and right now I'm not exactly sure.
Edited by Yorick Peterse