2020-05-07: about.gitlab.com went down briefly

Incident: #2087 (closed)

Summary

On May 7th, 2020 between 04:26AM UTC - 04:58AM UTC, about.gitlab.com went down and showed:

This XML file does not appear to have any style information associated with it. The document tree is shown below.

NoSuchKey
The specified key does not exist.

Service(s) affected : about.gitlab.com
Team attribution : TBD
Minutes downtime or degradation : 32 mins

Metrics

N/A

Customer Impact

**Who was impacted by this incident?**
All customers who tried to browse to the about.gitlab.com page
What was the customer experience during the incident?
They were getting an XML error: "NoSuchKey"
**How many customers were affected?**

According to https://log.gprd.gitlab.net/goto/1e16992114f9fe6c4af41a80ea770b0b, there were about 19K requests that came in for about.gitlab.com.
And According to https://log.gprd.gitlab.net/goto/13ff14b8d7f9b0c5743b4346bd571d4c, there were 2,421 unique remote IP addresses during the time frame.

Incident Response Analysis

**How was the event detected?**

User reported in Slack via: https://gitlab.slack.com/archives/C101F3796/p1588825616259400
2 minutes later alert came in: https://gitlab.slack.com/archives/C101F3796/p1588825692259800

**How could detection time be improved?**

I don't think it needs improvement. I believe we detected the issue in good amount of time.

**How did we reach the point where we knew how to mitigate the impact?**

When a team member pointed to a pipeline deployment which was ongoing

**How could time to mitigation be improved?**

It was just a matter of time to wait until deployment finished

Post Incident Analysis

**How was the root cause diagnosed?**

Proximal root cause points to a recent deployment: https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/541883682 where the logs show there were files being removed including the main index.html. However, we need to bring in team member(s) who may have gotten involved with the deployment to get data point on the root cause.

**How could time to diagnosis be improved?**

EOC wasn't aware of index.html missing right away eve though a general knowledge about a website being hosted from a cloud bucket would most likely need one. Improvement to Runbook: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/about-gitlab-com.md could help improving MTTD
Created https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10091 for the above corrective action item

**Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?**
N/A
**Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)?**
Yes and yes.

Timeline

2020-05-07

04:26 - Team member reports an issue with Customers Portal FAQ
04:28 - PD alert: Pingdom check check:https://gitlab.com/ is down
04:29 - PD alert: Pingdom check check:https://about.gitlab.com/ is down
04:30 - Incident declared from Slack
04:31 - PD alert: Pingdom check check:http://about.gitlab.com/ is down
04:33 - PD alert: Firing 1 - www.gitlab.com is down for 2 minutes
04:33 - PD alert: Pingdom check check:http://gitlab.org/ is down
04:41 - PD alert: Firing 1 - Chef client failures have reached critical levels
04:41 - EOC finds out index.html is missing in the GCP bucket
04:52 - Team member helps finding an ongoing deployment which should hopefully resolve the issue and also a recent deployment which removed the index.html.
04:58 - PD alerts got resolved

5 Whys

Why did we get paged for about.gitlab.com being down? => Because the about.gitlab.com page wasn't responding properly with 200 and it was reporting with XML error "NoSuchKey".
Why was the about.gitlab.com returning the XML error? => Because the index.html was gone.
Why was the index.html was gone? => It is not confirmed 100% yet, but it looks like it was because there was this deployment https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/541883682 which removed the index.html among other files.
Why did the deployment remove files? => The artifacts from two of the prior build_master jobs were not properly downloaded by the deploy job. See details here: #2088 (comment 337974367)
Why didn't the deploy job properly download the artifacts? => Appears to be related to this issue: gitlab-org/gitlab#212349 (closed)

Lessons Learned

A bad deployment could wipe out essential files that make the website working.
A domain knowledge about a bucket-sourced website such as about.gitlab.com would have been helpful

Corrective Actions

Update runbook: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10091
Add logic to CI config to guard against site outages due to artifact upload errors: gitlab-com/www-gitlab-com#7661 (closed)

Guidelines

Blameless RCA Guideline

Edited May 11, 2020 by Amarbayar Amarsanaa