PostgreSQL Split Brain
Context
At 01:00 UTC, the PostgreSQL process on postgres-02 and postgres-03 restarted. postgres-02 restarted as a master and postgres-03 restarted to start following postgres-02 as the new master. This occurred due to failure in DNS resolution for postgres-01.db.prd.gitlab.com. From 01:00 until 01:41, we may have been serving stale data.
Timeline
On date: 2017-01-08
- 01:00 UTC - The postgres process on
postgres-02restarted when it failed to resolvepostgres-01.db.prd.gitlab.comand restarted as master. - 01:00 UTC - The postgres process on
postgres-03restarted to begin followingpostgres-01as master. - 01:08 UTC - The alert for replication lag fires
- 01:15 UTC - The oncall engineer discovered that the
postgres-03process restarted - 01:30 UTC - It is discovered that repmgr has gotten out of sync and we had a split-brain issue
- 01:30 UTC - The oncall engineer called a team member to help out
- 01:41 UTC - The deploy page is put up to avoid writes to the database
- 01:42 UTC -
postgres-02andpostgres-03are removed from the DB load balancing- This was done by removing their IPs from db_load_balancing in the gitlab-base and canary-base roles.
- 01:43 UTC - The removal is deployed to the fleet
- This was deployed by running chef-client on all the
feandbeservers.
- This was deployed by running chef-client on all the
- 01:48 UTC - We begin to rebuild replication on
postgres-02gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com
- 02:06 UTC - The removal finished deploying
- 02:06 UTC - The deploy page is taken down
- 02:44 UTC -
postgres-02finished repairing backup and was added back to repmgrgitlab-ctl repmgr standby register
- 02:54 UTC - Begin repair of
postgres-03replicationgitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com
- 03:50 UTC -
postgres-03finished repairing and was added back to repmgr anddb_load_balancinggitlab-ctl repmgr standby register
Incident Analysis
- PagerDuty alert
Postgres Replication lag is over 2 minutesalerted us to the issue. - Auto failover was enabled earlier this week, which is what led to this split brain situation.
- This was unrelated to the db failover this morning (#3496 (closed)).
Root Cause Analysis
What went well
What can be improved
Corrective actions
Guidelines
Edited by Alex Hanselka