PostgreSQL Split Brain

Context

At 01:00 UTC, the PostgreSQL process on postgres-02 and postgres-03 restarted. postgres-02 restarted as a master and postgres-03 restarted to start following postgres-02 as the new master. This occurred due to failure in DNS resolution for postgres-01.db.prd.gitlab.com. From 01:00 until 01:41, we may have been serving stale data.

Timeline

On date: 2017-01-08

  • 01:00 UTC - The postgres process on postgres-02 restarted when it failed to resolve postgres-01.db.prd.gitlab.com and restarted as master.
  • 01:00 UTC - The postgres process on postgres-03 restarted to begin following postgres-01 as master.
  • 01:08 UTC - The alert for replication lag fires
  • 01:15 UTC - The oncall engineer discovered that the postgres-03 process restarted
  • 01:30 UTC - It is discovered that repmgr has gotten out of sync and we had a split-brain issue
  • 01:30 UTC - The oncall engineer called a team member to help out
  • 01:41 UTC - The deploy page is put up to avoid writes to the database
  • 01:42 UTC - postgres-02 and postgres-03 are removed from the DB load balancing
  • 01:43 UTC - The removal is deployed to the fleet
    • This was deployed by running chef-client on all the fe and be servers.
  • 01:48 UTC - We begin to rebuild replication on postgres-02
    • gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com
  • 02:06 UTC - The removal finished deploying
  • 02:06 UTC - The deploy page is taken down
  • 02:44 UTC - postgres-02 finished repairing backup and was added back to repmgr
    • gitlab-ctl repmgr standby register
  • 02:54 UTC - Begin repair of postgres-03 replication
    • gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com
  • 03:50 UTC - postgres-03 finished repairing and was added back to repmgr and db_load_balancing
    • gitlab-ctl repmgr standby register

Incident Analysis

  • PagerDuty alert Postgres Replication lag is over 2 minutes alerted us to the issue.
  • Auto failover was enabled earlier this week, which is what led to this split brain situation.
  • This was unrelated to the db failover this morning (#3496 (closed)).

Root Cause Analysis

What went well

What can be improved

Corrective actions

Guidelines

Edited by Alex Hanselka