RCA :: 2020-06-02 - Handbook outage
Summary
On 2020-06-02, !51499 (merged), was deployed which progresses our move to a monorepo structure in the www-gitlab-com repo. This deploy contained a bug in the deployment job that caused the Handbook pages not to be deployed to production.
Service(s) affected : https://about.gitlab.com/handbook/
Team attribution : Static Site Editor team responsible for Handbook escalation as per https://about.gitlab.com/handbook/about/on-call/
Minutes downtime or degradation : 78min
Impact & Metrics
| Question | Answer | 
|---|---|
| What was the impact | The Handbook was missing from the website and all URLs to the /handbook/ namespace returned a 404 response | 
| Who was impacted | All visitors to the Handbook, internal and external to GitLab | 
| How did this impact customers | It did not impact customers in as far as using GitLab the product on gitlab.com relates | 
| How many attempts made to access | n/a | 
| How many customers affected | n/a | 
| How many customers tried to access | n/a | 
Detection & Response
| Question | Answer | 
|---|---|
| When was the incident detected? | 23:21 UTC | 
| How was the incident detected? | @pcalder Reported it to the #handbook-escalation channel in Slack | 
| Did alarming work as expected? | There is no monitoring in place to check if the Handbook pages are accessible | 
| How long did it take from the start of the incident to its detection? | 46min (22:35 UTC -> 23:21 UTC | 
| How long did it take from detection to remediation? | 32min (23:21 UTC -> 23:53 UTC) | 
| What steps were taken to remediate? | The cause of the bug was identified and a fix pushed to master | 
| Were there any issues with the response? | The Handbook On-call process was used to escalate the incident at 23:21 UTC and the escalation was acknowledged 23:37 UTC by @cwoolley-gitlab. This is a 16min response time, which is within the SLO defined of 60min. | 
Timeline
2020-06-02
- 22:35 UTC: !51499 (merged) got deployed to product
 - 23:21 UTC: @pcalder raised handbook-escalation in Slack
 - 23:37 UTC: @cwoolley-gitlab acknowledged escalation
 - 23:46 UTC: @cwoolley-gitlab pushes fix to master:
 - 23:52 UTC: Deploy to production finishes resolving the outage
 
Root Cause Analysis
A deploy job caused all the GitLab Handbook pages to be missing from the production site.
- Why? - The deploy job (in .gitlab-ci.yml) did not declare that it needed the 
build-handbookjob resulting in the deploy omitting the Handbook pages - Why? - The bug was not picked up in testing of the MR that introduced the first part of the monorepo restructure.
 - Why? - Testing did not cover checking the actual deploy to production job
 - Why? - Testing the deploy job would have required setting up a dedicated test bucket and credentials on GCP
 - Why? - We don't use a staging environment for about.gitlab.com like one does in the traditional sense
 - Why? - Since changes to the codebase is mostly content related we want to ship as fast as possible to production so we don't go through a staging verification process before master is deployed to production
 
What went well
- Handbook On-call process was triggered and response was provided within the First-response SLO
 - @cwoolley-gitlab was able to investigate and resolve the issue in a timely manner
 
What can be improved
- Better monitoring to alert to outages (404s) of pages on the http://about.gitlab.com website
 - Use of staging site and verification process for all major updates to the website
 
Corrective actions
- The Static Site Editor team should look into establishing a staging environment, properly configured and utilized for verification of any major changes before it is deployed to production: #7941 (closed)
 - The Static Site Editor team needs to implement better monitoring and alerts, other than just a broken master alert: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10493
 
Edited  by Jean du Plessis