Skip to content

Reverse engineer database failover and reapply to staging

As a first step before we can execute a fire drill, we need to understand how the failover mechanism is set up in production and replicate it in staging.

For this we need:

  • 2 databases in staging, one working as a primary, another one working a secondary following primary.
  • Corosync setup so we can trigger a failover
  • All this properly documented.

With this, we can perform drills in staging to get comfortable with the setup. Then we can talk about performing a failover in production.

cc/ @jtevnan @yorickpeterse


Previous issue

Fire drill postgres failover and replication recovery

With all the changes that have had happened to the staging and production cluster with load balancing I think we need to fire drill performing a failover first in staging, then in production when we are confident that we will survive it.

The reason why I think this is a critical thing is because:

  • The infrastructure has changed a lot.
  • Our runbooks for recovering replication are outdated.
  • We have no idea if corosync is actually working as expected.
  • I would not like to discover that this doesn't work in the middle of a production incident.

We can use this opportunity to:

  • update our processes and try to automate recovering with a manual trigger
  • see how it behaves from the monitoring perspective and do it in staging while running a siege to see what can we expect to happen with the customers.

So, @yorickpeterse when can we schedule this drill to happen?

cc/ @ernstvn