Consider paging DB specialists and/or production on database alerts

What are we going to do?

Database alerts are mostly set to severity "warning" or even "info" because they were new and experimental. For the most part these alerts have been tuned to not be too noisy (with one or two possible exceptions) and it's time to consider paging from them.

Should we page only DB team? Should we page Production Oncall also? Should we just page Production Oncall and let them escalate to DB Team?

Based on experience with https://gitlab.com/gitlab-com/infrastructure/issues/3711 not paging DB specialists can 10 minutes delay in resolving issues. And in other cases Production Oncall will resolve database issues without DB team involvement. To establish credible ownership it valuable to have DB team involved immediately during outages.

Based on that same outage perhaps we should consider paging DB team even for relatively minor database alerts such as "number of databases changed". We don't except frequent HA failovers so the amount of noise should be minimal.

(The only exception to the "tuned to not be too noisy" is the XLOG generation high alert.)

Why are we doing it?

When are we going to do it?

Start time: ___
Duration: ___
Estimated end time: ___

How are we going to do it?

How are we preparing for it?

What can we check before starting?

What can we check afterwards to ensure that it's working?

Impact

Type of impact: <internal|client facing|no impact>
What will happen: ___
Do we expect downtime? (set the override in pagerduty): ___

How are we communicating this to our customers?

Announce the deployment well in advance: ___
Tweet after the change.

What is the rollback plan?

Monitoring

Graphs to check for failures:
Graphs to check for improvements:
Alerts that may trigger:

[IF NEEDED]

Google Doc to follow during the change (remember to link in the on-call log)

Scheduling

Schedule a downtime in the production calendar twice as long as your worst duration estimate, be pessimistic (better safe than sorry)

When things go wrong (downtime or service degradation)

Label the change issue as outage
Perform a blameless post mortem