production postgres and redis failover

What are we going to do?

Failover postgres and caching redis

Why are we doing it?

Our cloud provier has contacted us to reboot our instances: among them is the primary postgres and our redis cache instances

When are we going to do it?

Start time: 2018-01-07 17 UTC
Duration: approx. 2 hour
Estimated end time: 2018-01-07 19 UTC

How are we going to do it?

One team member will concentrate on redis: the fail-over will be done manually via sentinel. One team member will concentrate on postgres: the fail-over will be done via our omnibus packaged HA solution.

How are we preparing for it?

Redis:

This is being done here: https://gitlab.com/gitlab-com/infrastructure/issues/3498

Postgres:

this has been executed successfully on staging numerous times.
dbs are being rechecked as well as our pgbouncer setup
this is being done here: https://gitlab.com/gitlab-com/infrastructure/issues/3489

What can we check before starting?

Redis:

All the slaves are connected and up2date.
All the sentinels are online and in sync.

Postgres:

all secondaries are connected and up2date
all application servers are pointing at the correct pgbouncer
the primary is healthy (no background migrations, no large queries)

What can we check afterwards to ensure that it's working?

Redis:

Sentinel logs should go silent.
All the connections are successfully hitting the masters.

Postgres:

all 3 secondaries are pointing to the correct master
repmgrd reports no errors
wale backups are (still) running
primary pgbouncer is pointing to the correct master

Impact

Type of impact: client-facing
What will happen: client timeouts, write errors
Do we expect downtime? (set the override in pagerduty): The fail-over should be executed without downtime, however it is possible that if an unforeseen problem should arise, downtime may be the result

How are we communicating this to our customers?

What is the rollback plan?

This is a purely roll forward plan: once we change masters, a reconfiguration is inevitable.

Monitoring

Graphs to check for failures:
Alerts that may trigger:
- availibility
- replication delay

[IF NEEDED]

Google Doc to follow during the change (remember to link in the on-call log)

https://docs.google.com/document/d/1HmiwyxplvIJaxJXXqWOd2RyRuT3HIWSs7CgnxnbCQ9w/edit?usp=sharing

Scheduling

Schedule a downtime in the production calendar twice as long as your worst duration estimate, be pessimistic (better safe than sorry)

When things go wrong (downtime or service degradation)

Label the change issue as outage
Perform a blameless post mortem

References

Edited Jan 07, 2018 by Jason Tevnan