2019-09-03 Patroni failover

Summary

A brief summary of what happened. Try to make it as executive-friendly as possible.

Service(s) affected :
Team attribution :
Minutes downtime or degradation : None

Timeline

2019-09-03

  • 23:44 UTC - First alert: Pingdom check check:https://gitlab.com/gitlab-org/gitlab-ce/issues is down https://gitlab.slack.com/archives/C101F3796/p1567554260290400
  • 23:44 UTC - Alert: PostgreSQL_ServiceDown https://gitlab.slack.com/archives/C101F3796/p1567554269290500

2019-09-04

  • 00:02 UTC - Most things seem to be recovering, except for replication delay https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&fullscreen&panelId=968&from=1567553431327&to=1567555353841
  • 00:40 UTC - pg_up == 1 in most hosts now, except patroni-01 (to be expected) and patroni-10 (old master)
  • 00:40 UTC - Ongres joins call (Sergio)
  • 01:50 UTC - Build of patroni-12 begun
  • 01:51 UTC - Running ANALYZE on the new master as per https://gitlab.com/gitlab-com/runbooks/blob/master/howto/patroni-management.md#update-statistics-views-on-the-new-master\
  • 03:05 UTC - earlier attempt to bring patroni-12 into sync was interrupted by the deletion of is replication slot. This has been started again.
  • Unknown timings - patroni-13 and patron-14 brought up and added to the cluster as replicas.

2019-09-05

  • 01:15 UTC - patroni-01 replication slot removed, to prevent WAL logs growing on patroni-11 (current leader).
Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading