Staging alerting and SLA discussion
Over the weekend I got an alert for staging being down. I spent about an hour and a half troubleshooting the issue with the help of @cmiskell. It turned out to be a deploy that broke things with regards to database connections being exhausted. Thus, I think we should discuss team ownership and SLA of staging.
Some questions to answer:
- What team should be DRI for staging (I think this is delivery right now?)
- What should be the SLA for staging? Should an SRE spend their weekend trying to fix staging after a bad deploy for example.
- QA has daily tests running on staging
- Developers have full access to staging
- Deployments will not go to production if staging is broken
- Should we have an SLA? If we want to have an SLA, should we fix all of our alerts for staging. Right now, only the alert for staging is staging not returning 200 for 30 minutes and it's going to pager duty as a high priority alert.
- We originally silenced alerts due to alert fatigue because they were very noisy and not actionable.