PostgreSQL Split Brain
## Context
At 01:00 UTC, the PostgreSQL process on `postgres-02` and `postgres-03` restarted. `postgres-02` restarted as a master and `postgres-03` restarted to start following `postgres-02` as the new master. This occurred due to failure in DNS resolution for `postgres-01.db.prd.gitlab.com`. From 01:00 until 01:41, we may have been serving stale data.
## Timeline
On date: 2017-01-08
- 01:00 UTC - The postgres process on `postgres-02` restarted when it failed to resolve `postgres-01.db.prd.gitlab.com` and restarted as master.
- 01:00 UTC - The postgres process on `postgres-03` restarted to begin following `postgres-01` as master.
- 01:08 UTC - The alert for replication lag fires
- 01:15 UTC - The oncall engineer discovered that the `postgres-03` process restarted
- 01:30 UTC - It is discovered that repmgr has gotten out of sync and we had a split-brain issue
- 01:30 UTC - The oncall engineer called a team member to help out
- 01:41 UTC - The deploy page is put up to avoid writes to the database
- 01:42 UTC - `postgres-02` and `postgres-03` are removed from the DB load balancing
- This was done by removing their IPs from [db_load_balancing in the gitlab-base](https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gitlab-base.json#L255-261) and [canary-base](https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/canary-base.json#L273-280) roles.
- 01:43 UTC - The removal is deployed to the fleet
- This was deployed by running chef-client on all the `fe` and `be` servers.
- 01:48 UTC - We begin to rebuild replication on `postgres-02`
- `gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com`
- 02:06 UTC - The removal finished deploying
- 02:06 UTC - The deploy page is taken down
- 02:44 UTC - `postgres-02` finished repairing backup and was added back to repmgr
- `gitlab-ctl repmgr standby register`
- 02:54 UTC - Begin repair of `postgres-03` replication
- `gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com`
- 03:50 UTC - `postgres-03` finished repairing and was added back to repmgr and `db_load_balancing`
- `gitlab-ctl repmgr standby register`
## Incident Analysis
- PagerDuty alert `Postgres Replication lag is over 2 minutes` alerted us to the issue.
- Auto failover was enabled earlier this week, which is what led to this split brain situation.
- This was unrelated to the db failover this morning (infrastructure#3496).
## Root Cause Analysis
## What went well
## What can be improved
## Corrective actions
* https://gitlab.com/gitlab-com/infrastructure/issues/3561
## Guidelines
* [Blameless Postmortems Guideline](https://about.gitlab.com/handbook/infrastructure/#postmortems)
* [5 whys](https://en.wikipedia.org/wiki/5_Whys)
issue