2019-09-03 Patroni failover
Summary
A brief summary of what happened. Try to make it as executive-friendly as possible.
Service(s) affected :
Team attribution :
Minutes downtime or degradation : None
Timeline
2019-09-03
- 23:44 UTC - First alert: Pingdom check check:https://gitlab.com/gitlab-org/gitlab-ce/issues is down https://gitlab.slack.com/archives/C101F3796/p1567554260290400
- 23:44 UTC - Alert: PostgreSQL_ServiceDown https://gitlab.slack.com/archives/C101F3796/p1567554269290500
2019-09-04
- 00:02 UTC - Most things seem to be recovering, except for replication delay https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&fullscreen&panelId=968&from=1567553431327&to=1567555353841
- 00:40 UTC -
pg_up == 1in most hosts now, exceptpatroni-01(to be expected) andpatroni-10(old master) - 00:40 UTC - Ongres joins call (Sergio)
- 01:50 UTC - Build of patroni-12 begun
- 01:51 UTC - Running
ANALYZEon the new master as per https://gitlab.com/gitlab-com/runbooks/blob/master/howto/patroni-management.md#update-statistics-views-on-the-new-master\ - 03:05 UTC - earlier attempt to bring patroni-12 into sync was interrupted by the deletion of is replication slot. This has been started again.
- Unknown timings - patroni-13 and patron-14 brought up and added to the cluster as replicas.
2019-09-05
- 01:15 UTC - patroni-01 replication slot removed, to prevent WAL logs growing on patroni-11 (current leader).
Edited by 🤖 GitLab Bot 🤖