Skip to content

Propose database cluster configurations that prevent data loss on failover

Sub-issue of https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7282.

As per the handbook, data loss should be prevented, and this takes priority over availability.

Propose some postgres + patroni configurations that prevent data loss when the postgres master fails over. Each config should ensure that the master has received acknowledgement from at least 1 replica node before acknowledging a write to the client (2-safe replication). Some configs may require more acks than others.

Each config should be delivered in the form of a (WIP) MR.

Each config should come with a explanation of the intended and likely behaviour, including effects on availability, when failover occurs:

  • How many replica nodes are guaranteed to have a fully up-to-date set of writes when failover occurs?
  • How many nodes must fail at around the same time in order for patroni to refuse to promote a replica, causing downtime?
  • If patroni refuses to promote a replica under this config because it would cause data loss, how would you intervene manually in order to restore availability, even if this causes data loss? This should be delivered in the form of a runbook MR.
  • What happens if all nodes in a zone were to fail?
  • Assume that if all nodes in a region fail, we will be down. GitLab is not multiregional, yet.
  • Any more questions that you can think of?

Evidence from a sandbox environment should be provided to lend weight to these descriptions, including patroni+postgres log extracts.

No performance benchmarking needs to take place in this issue, that will be handled in a follow on (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8595).

Resources that may be useful:

Edited by Craig Furman