Incident review for 2022-11-30: GitLab.com site-wide outage

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All traffic to GitLab.com.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Users would have received 5XX errors during the outage.
How many customers were affected?
1. ~100% of traffic
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Our CDN reports ~13M requests that resulted in 503 during the outage.

What were the root causes?

An engineer triggered a routine task using our tools on a local development environment that due to an unfortunate series of events and unexpected tool behavior, it resulted in performing a change to production that took down a core component (Consul) of our Postgres HA solution (Patroni).

Incident Response Analysis

How was the incident detected?
1. We were alerted immediately by our monitoring & alerting infrastructure.
How could detection time be improved?
1. We were almost immediately aware of a major incident occurring as soon as the Consul cluster went down impacting on our Patroni clusters.
How was the root cause diagnosed?
1. As one of the alerts indicated a Patroni failover, it was the first component to be investigated. Patroni wasn’t responding, so we looked at Consul and noticed the Consul servers were missing. We quickly realized that the Consul release in our Kubernetes cluster was missing and proceeded to reinstall it.
How could time to diagnosis be improved?
1. In this incident, our investigation quickly led us to the Patroni clusters and they were unhealthy. Given their dependence on Consul, we followed the trail and found that the Consul server cluster was missing.
How did we reach the point where we knew how to mitigate the impact?
1. Once it was clear that the entire consul release had been deleted (all Consul related pods, services, etc were missing), we moved to re-applying the release to production and the Consul cluster quickly came up, and the database clusters recovered.
How could time to mitigation be improved?
1. The tooling we’re using did not make it straightforward for us to redeploy the Consul release and it required manual intervention. It’s likely that improving our Kubernetes deployment story can help here (see PoC: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16875)

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. We are looking to improve our Kubernetes deployment story, which would greatly reduce the risk of running potentially harmful deployment operations against a production environment from our local environments - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16875
2. We are also working on extending our use of Teleport by integrating it with our Kubernetes clusters, and this could be configured to assign a read-only role by default to SREs reducing the risk of accidentally making changes to a production environment - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16221
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No. This incident was triggered by an engineer performing routine local development work using a well-known open source tool that behaved in unexpected ways when the context changed from local development to production during a local deploy/rollback operation. We are urgently working on corrective actions to avoid recurrence in future.

What went well?

Bringing the Consul cluster back up in itself did not take very long. It involved some manual work, but it was carried out relatively quickly.
We were alerted immediately that something major had happened and we identified the root cause within minutes.

Guidelines

Blameless RCA Guideline

Edited Dec 01, 2022 by Gonzalo Servat