GitLab.com elevated error rates and site down for short period June 2, 2019

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.

Summary

Working doc started when issue was not accessible: https://docs.google.com/document/d/1RM3QnuJ4FPH10J3UrJS0T26d-mn_Dd11-jmZweHQVV8/edit#heading=h.cvpl7op2up1d

Service(s) affected : All GitLab.com

Team attribution :

Minutes downtime or degradation : 20 min down, 190 min degraded.

https://dashboards.gitlab.net/d/ZUei7TkWz/platform-metrics?orgId=1&fullscreen&panelId=3&from=1559498453907&to=1559516462767

Timeline

2019-06-02

18:50 UTC - Patroni failed over from -04 to -06
19:05 UTC - Most of our Grafana dashboards are inconsistently working because of thanos issues: #849 (closed)
19:07 UTC - Pingdom returning errors
19:12 UTC - Diagnosis of Postgres failover
19:21 UTC - Services hup'd
19:26 UTC - GitLab.com operational. Pingdom reporting services as up.
19:41 UTC - Watching https://status.cloud.google.com/incident/compute/19003
19:59 UTC - also watching https://status.cloud.google.com/incident/cloud-networking/19009
20:39 UTC - Continuing to monitor google incidents
21:20 UTC - possibly another failover from patroni-04 to -01

2019-06-03

09:40 UTC - Restoring tuple statistics by running cluster-wide ANALYZE, see https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5841#note_128321668 (done 10:10 UTC)

Edited Jul 08, 2019 by Dave Smith