Reverse engineer database failover and reapply to staging
As a first step before we can execute a fire drill, we need to understand how the failover mechanism is set up in production and replicate it in staging.
For this we need:
-
2 databases in staging, one working as a primary, another one working a secondary following primary. -
Corosync setup so we can trigger a failover -
All this properly documented.
With this, we can perform drills in staging to get comfortable with the setup. Then we can talk about performing a failover in production.
Previous issue
Fire drill postgres failover and replication recovery
With all the changes that have had happened to the staging and production cluster with load balancing I think we need to fire drill performing a failover first in staging, then in production when we are confident that we will survive it.
The reason why I think this is a critical thing is because:
- The infrastructure has changed a lot.
- Our runbooks for recovering replication are outdated.
- We have no idea if corosync is actually working as expected.
- I would not like to discover that this doesn't work in the middle of a production incident.
We can use this opportunity to:
- update our processes and try to automate recovering with a manual trigger
- see how it behaves from the monitoring perspective and do it in staging while running a siege to see what can we expect to happen with the customers.
So, @yorickpeterse when can we schedule this drill to happen?
cc/ @ernstvn