Skip to content

2020-05-07: about.gitlab.com went down briefly

Incident: #2087 (closed)

Summary

On May 7th, 2020 between 04:26AM UTC - 04:58AM UTC, about.gitlab.com went down and showed:

This XML file does not appear to have any style information associated with it. The document tree is shown below.

NoSuchKey
The specified key does not exist.
  1. Service(s) affected : about.gitlab.com
  2. Team attribution : TBD
  3. Minutes downtime or degradation : 32 mins

Metrics

N/A

Customer Impact

  1. **Who was impacted by this incident?**
    All customers who tried to browse to the about.gitlab.com page

  2. What was the customer experience during the incident?
    They were getting an XML error: "NoSuchKey"

  3. **How many customers were affected?**

Incident Response Analysis

  1. **How was the event detected?**
  1. **How could detection time be improved?**
  • I don't think it needs improvement. I believe we detected the issue in good amount of time.
  1. **How did we reach the point where we knew how to mitigate the impact?**
  • When a team member pointed to a pipeline deployment which was ongoing
  1. **How could time to mitigation be improved?**
  • It was just a matter of time to wait until deployment finished

Post Incident Analysis

  1. **How was the root cause diagnosed?**
  • Proximal root cause points to a recent deployment: https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/541883682 where the logs show there were files being removed including the main index.html. However, we need to bring in team member(s) who may have gotten involved with the deployment to get data point on the root cause.
  1. **How could time to diagnosis be improved?**
  1. **Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?**
    N/A

  2. **Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)?**
    Yes and yes.

Timeline

2020-05-07

5 Whys

  • Why did we get paged for about.gitlab.com being down? => Because the about.gitlab.com page wasn't responding properly with 200 and it was reporting with XML error "NoSuchKey".
  • Why was the about.gitlab.com returning the XML error? => Because the index.html was gone.
  • Why was the index.html was gone? => It is not confirmed 100% yet, but it looks like it was because there was this deployment https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/541883682 which removed the index.html among other files.
  • Why did the deployment remove files? => The artifacts from two of the prior build_master jobs were not properly downloaded by the deploy job. See details here: #2088 (comment 337974367)
  • Why didn't the deploy job properly download the artifacts? => Appears to be related to this issue: gitlab-org/gitlab#212349 (closed)

Lessons Learned

  1. A bad deployment could wipe out essential files that make the website working.
  2. A domain knowledge about a bucket-sourced website such as about.gitlab.com would have been helpful

Corrective Actions

Guidelines

Edited by Amarbayar Amarsanaa