Incident Review for Site-wide Outage for GitLab.com - Stale Terraform Pipeline #15997
The DRI for the incident review is the issue assignee.
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
Who was impacted by this incident? (i.e. external customers, internal customers)
- All users of gitlab.com, including external and internal customers.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- GitLab.com was unavailable on 2023-07-07 from 16:25 UTC to 18:42 UTC. During this time the web and API interfaces were not available (503). Customers were able to perform git actions via the command line.
- For customers that did not have DNS records cached, Container Registry was unavailable on 2023-07-07 from 16:25 UTC to 19:36 UTC.
- A small number of git pushes on 2023-07-07 from 15:55 UTC to 16:17 UTC are not available on GitLab.com until the changes are pushed again from a local copy.
- We have restored data to known recovery points, and a small subset of customer projects requires a refresh using their local copy.
- The impacted project owners have been notified and were advised to re-push their changes.
How many customers were affected?
- All customers.
- Ongoing investigation: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24086
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- All customers.
What were the root causes?
An outdated production configuration was applied to our production environment, which caused several GitLab.com production services to be removed and replaced.
- The root cause was an out-of-sync infrastructure configuration plan (Terraform) executed against our production environment.
- This infrastructure configuration plan was prepared 3-weeks prior in preparation for our production database upgrade.
- Environmental drifts accumulated during those 3 weeks causing the planned configuration and production environment to become out-of-sync.
- Executing the out-of-sync plan caused an unintended removal of production services which resulted in the outage.
- We typically execute configuration plans shortly after they are prepared. However, the execution of this 3-week-old configuration plan exposed a gap in our process.
Incident Response Analysis
How was the incident detected?
- 16:10 UTC - A job begins to apply that is applying an old TF configuration
- 16:15 UTC - EOC received Blackbox PagerDuty notifications about https://cdn-artifacts.gitlab-static.net URLs failing
- 16:15 UTC - DBRE posted in the #production channel that they were applying a terraform plan via CI
- 16:18 UTC - EOC declared an S2 incident
- 16:20 UTC - Based on slack reports and personal observation of 5xx errors on GitLab.com, EOC attempted to upgrade incident to an S1.
- It was five minutes from when the job began applying destructive changes until the EOC was notified of an initial problem. And ten minutes until it was clear that this was an S1 site outage incident.
How could detection time be improved?
How was the root cause diagnosed?
- 18:25 UTC - Checked Cloudflare status page
- 18:25 UTC - DBRE brings the calls attention to a running Terraform apply job
- 18:28 UTC - First look at the MR attached to the pipeline seems harmless
- 18:30 UTC - Examining the running Terraform apply job revealed several resources being destroyed. Referencing the plan for the pipeline showed 617 resources to be destroyed.
- 18:31 UTC - The job was stopped to try and prevent further destruction.
- 18:34 UTC - A local plan against the production environment was run by the EOC to see what cloud resources were missing.
- 18:37 UTC - At this point, it appeared likely that the applied plan had caused the outage due to destroyed resources.
How could time to diagnosis be improved?
- We had to fall back to Google Docs to manage this incident since GitLab.com was unavailable, however, a GitLab issue was eventually created once GitLab.com was back online #15997 (closed). Having both the doc and issue caused back and forth copying and pasting of data which was inefficient while trying to manage the incident. One way to improve this is to use our ops instance for incident tracking rather than Google docs. Though, this would lead to additional problems and is currently discarded. Alternative solution are being discussed and a follow-up issue has been created to continue exploring them: Improve Incident Management process when gitlab... (gitlab-com/www-gitlab-com#34382 - moved)
How did we reach the point where we knew how to mitigate the impact?
- Spent some time trying to "fix" terraform, or get a better handle on how a restore might work without having to slowly apply each difference one at a time.
- While that was happening, work was put into trying to assess what specific disks and other systems were missing.
- An attempt to remove the dependency on the redis cache cluster was put in progress to see if that would get the web fleet back to operating status.
- The call was broken into two zoom chats. The main incident room was used to focus on restoring services without using Terraform. The other was focused on restoring resources via Terraform.
- Getting the resources restored and the Terraform configuration into a clean state along with ensuring affected customers were notified was what we considered mitigated.
How could time to mitigation be improved?
- Identifying a list of affected customers was delayed due to tooling (rails console and Teleport) being offline. This tooling was dependent on first getting our Terraform configuration into a clean state.
- Drafting messaging for affected customers required quite a lot of cross-functional efforts and approvals (i.e. engineering, product, and support to assess impact and work with corporate communications to draft message, marketing ops to send the message, legal / customer success sign off). This can be improved by having pre-approved messaging templates for these types of outages allowing us to move quicker. Follow-up issue: https://gitlab.com/gitlab-com/www-gitlab-com/-/issues/34387+
Post Incident Analysis
- Did we have other events in the past with the same root cause?
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
What went well?
- A lot of people came together to help, even though it was a Friday, and Saturday!
- Our response processes were put to test, with an immediate good feedback from everyone involved, for e.g. the breakdown of zoom calls, threads, and documents to tackle different parts of the incident was excellent.