Consider paging DB specialists and/or production on database alerts
What are we going to do?
Database alerts are mostly set to severity "warning" or even "info" because they were new and experimental. For the most part these alerts have been tuned to not be too noisy (with one or two possible exceptions) and it's time to consider paging from them.
Should we page only DB team? Should we page Production Oncall also? Should we just page Production Oncall and let them escalate to DB Team?
Based on experience with https://gitlab.com/gitlab-com/infrastructure/issues/3711 not paging DB specialists can 10 minutes delay in resolving issues. And in other cases Production Oncall will resolve database issues without DB team involvement. To establish credible ownership it valuable to have DB team involved immediately during outages.
Based on that same outage perhaps we should consider paging DB team even for relatively minor database alerts such as "number of databases changed". We don't except frequent HA failovers so the amount of noise should be minimal.
(The only exception to the "tuned to not be too noisy" is the XLOG generation high alert.)
Why are we doing it?
When are we going to do it?
- Start time: ___
- Duration: ___
- Estimated end time: ___
How are we going to do it?
How are we preparing for it?
What can we check before starting?
What can we check afterwards to ensure that it's working?
Impact
- Type of impact: <internal|client facing|no impact>
- What will happen: ___
- Do we expect downtime? (set the override in pagerduty): ___
How are we communicating this to our customers?
- Announce the deployment well in advance: ___
- Tweet after the change.
What is the rollback plan?
Monitoring
- Graphs to check for failures:
-
- Graphs to check for improvements:
-
- Alerts that may trigger:
-
[IF NEEDED]
Google Doc to follow during the change (remember to link in the on-call log)
Scheduling
Schedule a downtime in the production calendar twice as long as your worst duration estimate, be pessimistic (better safe than sorry)
When things go wrong (downtime or service degradation)
- Label the change issue as outage
- Perform a blameless post mortem