Skip to content

Geo DR: planned failover (including of a multi-node secondary)

This is something that's come up in discussion between myself and Andrew, although I've seen it mentioned in various migration plans too.

The intended use case for Geo DR is turning a secondary into a primary in the event of the original primary being destroyed or lost somehow in an unforeseen event.

However, there is another use case - gracefully failing over between sites in the event of a requirement to migrate between datacentres that can be foreseen. Perhaps regulatory requirements are due to change, or the new site is just cheaper.

There are several considerations that we can handle for this use case. Here's some, roughly in priority order:

  • No in-progress work should be lost on the primary (running sidekiq jobs should be allowed to finish, for instance)
  • Transition should have as little downtime as possible
  • The secondary's caches (those that are not replicated between geo nodes - e.g., redis) should be warmed to avoid performance regressions in the immediate aftermath of the switch.
    • See discussion in comments below, this is now a non-issue
  • In the event of problems, it would be wonderful to have a means of reverting without data loss / consistency issues. This might be Too Difficult without multi-master, though.

There are probably other considerations as well.

/cc @stanhu @jramsay @ernstvn @andrewn

Edited by James Ramsay (ex-GitLab)