2020-05-07: about.gitlab.com went down briefly
Incident: #2087 (closed)
Summary
On May 7th, 2020 between 04:26AM UTC - 04:58AM UTC, about.gitlab.com went down and showed:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
NoSuchKey
The specified key does not exist.
- Service(s) affected : about.gitlab.com
- Team attribution : TBD
- Minutes downtime or degradation : 32 mins
Metrics
N/A
Customer Impact
-
**Who was impacted by this incident?**
All customers who tried to browse to the about.gitlab.com page -
What was the customer experience during the incident?
They were getting an XML error: "NoSuchKey" -
**How many customers were affected?**
- According to https://log.gprd.gitlab.net/goto/1e16992114f9fe6c4af41a80ea770b0b, there were about 19K requests that came in for about.gitlab.com.
- And According to https://log.gprd.gitlab.net/goto/13ff14b8d7f9b0c5743b4346bd571d4c, there were 2,421 unique remote IP addresses during the time frame.
Incident Response Analysis
- **How was the event detected?**
- User reported in Slack via: https://gitlab.slack.com/archives/C101F3796/p1588825616259400
- 2 minutes later alert came in: https://gitlab.slack.com/archives/C101F3796/p1588825692259800
- **How could detection time be improved?**
- I don't think it needs improvement. I believe we detected the issue in good amount of time.
- **How did we reach the point where we knew how to mitigate the impact?**
- When a team member pointed to a pipeline deployment which was ongoing
- **How could time to mitigation be improved?**
- It was just a matter of time to wait until deployment finished
Post Incident Analysis
- **How was the root cause diagnosed?**
- Proximal root cause points to a recent deployment: https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/541883682 where the logs show there were files being removed including the main
index.html. However, we need to bring in team member(s) who may have gotten involved with the deployment to get data point on the root cause.
- **How could time to diagnosis be improved?**
- EOC wasn't aware of
index.htmlmissing right away eve though a general knowledge about a website being hosted from a cloud bucket would most likely need one. Improvement to Runbook: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/about-gitlab-com.md could help improving MTTD - Created https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10091 for the above corrective action item
-
**Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?**
N/A -
**Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)?**
Yes and yes.
Timeline
2020-05-07
- 04:26 - Team member reports an issue with Customers Portal FAQ
- 04:28 - PD alert: Pingdom check check:https://gitlab.com/ is down
- 04:29 - PD alert: Pingdom check check:https://about.gitlab.com/ is down
- 04:30 - Incident declared from Slack
- 04:31 - PD alert: Pingdom check check:http://about.gitlab.com/ is down
- 04:33 - PD alert: Firing 1 - www.gitlab.com is down for 2 minutes
- 04:33 - PD alert: Pingdom check check:http://gitlab.org/ is down
- 04:41 - PD alert: Firing 1 - Chef client failures have reached critical levels
- 04:41 - EOC finds out
index.htmlis missing in the GCP bucket - 04:52 - Team member helps finding an ongoing deployment which should hopefully resolve the issue and also a recent deployment which removed the index.html.
- 04:58 - PD alerts got resolved
5 Whys
- Why did we get paged for about.gitlab.com being down? => Because the about.gitlab.com page wasn't responding properly with 200 and it was reporting with XML error "NoSuchKey".
- Why was the about.gitlab.com returning the XML error? => Because the index.html was gone.
- Why was the index.html was gone? => It is not confirmed 100% yet, but it looks like it was because there was this deployment https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/541883682 which removed the index.html among other files.
-
Why did the deployment remove files? => The artifacts from two of the prior
build_masterjobs were not properly downloaded by thedeployjob. See details here: #2088 (comment 337974367) -
Why didn't the
deployjob properly download the artifacts? => Appears to be related to this issue: gitlab-org/gitlab#212349 (closed)
Lessons Learned
- A bad deployment could wipe out essential files that make the website working.
- A domain knowledge about a bucket-sourced website such as about.gitlab.com would have been helpful
Corrective Actions
- Update runbook: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10091
- Add logic to CI config to guard against site outages due to artifact upload errors: gitlab-com/www-gitlab-com#7661 (closed)
Guidelines
Edited by Amarbayar Amarsanaa