production postgres and redis failover
What are we going to do?
Failover postgres and caching redis
Why are we doing it?
Our cloud provier has contacted us to reboot our instances: among them is the primary postgres and our redis cache instances
When are we going to do it?
- Start time: 2018-01-07 17 UTC
- Duration: approx. 2 hour
- Estimated end time: 2018-01-07 19 UTC
How are we going to do it?
One team member will concentrate on redis: the fail-over will be done manually via sentinel. One team member will concentrate on postgres: the fail-over will be done via our omnibus packaged HA solution.
How are we preparing for it?
Redis:
- This is being done here: https://gitlab.com/gitlab-com/infrastructure/issues/3498
Postgres:
- this has been executed successfully on staging numerous times.
- dbs are being rechecked as well as our pgbouncer setup
- this is being done here: https://gitlab.com/gitlab-com/infrastructure/issues/3489
What can we check before starting?
Redis:
- All the slaves are connected and up2date.
- All the sentinels are online and in sync.
Postgres:
- all secondaries are connected and up2date
- all application servers are pointing at the correct pgbouncer
- the primary is healthy (no background migrations, no large queries)
What can we check afterwards to ensure that it's working?
Redis:
- Sentinel logs should go silent.
- All the connections are successfully hitting the masters.
Postgres:
- all 3 secondaries are pointing to the correct master
- repmgrd reports no errors
- wale backups are (still) running
- primary pgbouncer is pointing to the correct master
Impact
- Type of impact: client-facing
- What will happen: client timeouts, write errors
- Do we expect downtime? (set the override in pagerduty): The fail-over should be executed without downtime, however it is possible that if an unforeseen problem should arise, downtime may be the result
How are we communicating this to our customers?
- Tweet before the change:
-
4 days -
24 hours -
12 hours -
1 hour
-
-
Tweet after the change. -
add a banner to remind people we are performing the failovers
What is the rollback plan?
This is a purely roll forward plan: once we change masters, a reconfiguration is inevitable.
Monitoring
- Graphs to check for failures:
- Alerts that may trigger:
- availibility
- replication delay
[IF NEEDED]
Google Doc to follow during the change (remember to link in the on-call log)
https://docs.google.com/document/d/1HmiwyxplvIJaxJXXqWOd2RyRuT3HIWSs7CgnxnbCQ9w/edit?usp=sharing
Scheduling
Schedule a downtime in the production calendar twice as long as your worst duration estimate, be pessimistic (better safe than sorry)
When things go wrong (downtime or service degradation)
- Label the change issue as outage
- Perform a blameless post mortem
References
Edited by Jason Tevnan