Procedural changes for Scalability team members preventing and mitigating incidents
This came up on Slack in the aftermath of production#17504 (closed). As corrective actions to that incident, we created a bunch of issues to catch this and similar problems before they reach production. In the course of that we'll also be working with stage groups on improving things that also contain similar risks.
What also came up during the incident call was:
- During the call, nobody from the Scalability-group was around, it was in between timezones. We might have helped diagnose the problem quicker because we have some knowledge about our Sidekiq architecture and where to look during incidents. This would also help us spot observability gaps. Should we be more actively involved in incidents (is besides regular on-call duties).
- How can we prevent things like this from reaching production in the first place?
- Do people know how- and when to find us for Scalability type reviews?
- Can we educate other maintainers to spot problems like this?