PostgreSQL Split Brain
## Context At 01:00 UTC, the PostgreSQL process on `postgres-02` and `postgres-03` restarted. `postgres-02` restarted as a master and `postgres-03` restarted to start following `postgres-02` as the new master. This occurred due to failure in DNS resolution for `postgres-01.db.prd.gitlab.com`. From 01:00 until 01:41, we may have been serving stale data. ## Timeline On date: 2017-01-08 - 01:00 UTC - The postgres process on `postgres-02` restarted when it failed to resolve `postgres-01.db.prd.gitlab.com` and restarted as master. - 01:00 UTC - The postgres process on `postgres-03` restarted to begin following `postgres-01` as master. - 01:08 UTC - The alert for replication lag fires - 01:15 UTC - The oncall engineer discovered that the `postgres-03` process restarted - 01:30 UTC - It is discovered that repmgr has gotten out of sync and we had a split-brain issue - 01:30 UTC - The oncall engineer called a team member to help out - 01:41 UTC - The deploy page is put up to avoid writes to the database - 01:42 UTC - `postgres-02` and `postgres-03` are removed from the DB load balancing - This was done by removing their IPs from [db_load_balancing in the gitlab-base](https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gitlab-base.json#L255-261) and [canary-base](https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/canary-base.json#L273-280) roles. - 01:43 UTC - The removal is deployed to the fleet - This was deployed by running chef-client on all the `fe` and `be` servers. - 01:48 UTC - We begin to rebuild replication on `postgres-02` - `gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com` - 02:06 UTC - The removal finished deploying - 02:06 UTC - The deploy page is taken down - 02:44 UTC - `postgres-02` finished repairing backup and was added back to repmgr - `gitlab-ctl repmgr standby register` - 02:54 UTC - Begin repair of `postgres-03` replication - `gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com` - 03:50 UTC - `postgres-03` finished repairing and was added back to repmgr and `db_load_balancing` - `gitlab-ctl repmgr standby register` ## Incident Analysis - PagerDuty alert `Postgres Replication lag is over 2 minutes` alerted us to the issue. - Auto failover was enabled earlier this week, which is what led to this split brain situation. - This was unrelated to the db failover this morning (infrastructure#3496). ## Root Cause Analysis ## What went well ## What can be improved ## Corrective actions * https://gitlab.com/gitlab-com/infrastructure/issues/3561 ## Guidelines * [Blameless Postmortems Guideline](https://about.gitlab.com/handbook/infrastructure/#postmortems) * [5 whys](https://en.wikipedia.org/wiki/5_Whys)
issue