GitLab.com elevated error rates and site down for short period June 2, 2019
Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.
Summary
Working doc started when issue was not accessible: https://docs.google.com/document/d/1RM3QnuJ4FPH10J3UrJS0T26d-mn_Dd11-jmZweHQVV8/edit#heading=h.cvpl7op2up1d
Service(s) affected : All GitLab.com
Team attribution :
Minutes downtime or degradation : 20 min down, 190 min degraded.
Timeline
2019-06-02
- 18:50 UTC - Patroni failed over from -04 to -06
- 19:05 UTC - Most of our Grafana dashboards are inconsistently working because of thanos issues: #849 (closed)
- 19:07 UTC - Pingdom returning errors
- 19:12 UTC - Diagnosis of Postgres failover
- 19:21 UTC - Services hup'd
- 19:26 UTC - GitLab.com operational. Pingdom reporting services as up.
- 19:41 UTC - Watching https://status.cloud.google.com/incident/compute/19003
- 19:59 UTC - also watching https://status.cloud.google.com/incident/cloud-networking/19009
- 20:39 UTC - Continuing to monitor google incidents
- 21:20 UTC - possibly another failover from patroni-04 to -01
2019-06-03
- 09:40 UTC - Restoring tuple statistics by running cluster-wide
ANALYZE, see https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5841#note_128321668 (done 10:10 UTC)
Edited by Dave Smith