Skip to content

RCA :: 2020-06-02 - Handbook outage

Summary

On 2020-06-02, !51499 (merged), was deployed which progresses our move to a monorepo structure in the www-gitlab-com repo. This deploy contained a bug in the deployment job that caused the Handbook pages not to be deployed to production.

Service(s) affected : https://about.gitlab.com/handbook/
Team attribution : Static Site Editor team responsible for Handbook escalation as per https://about.gitlab.com/handbook/about/on-call/
Minutes downtime or degradation : 78min

Impact & Metrics

Question Answer
What was the impact The Handbook was missing from the website and all URLs to the /handbook/ namespace returned a 404 response
Who was impacted All visitors to the Handbook, internal and external to GitLab
How did this impact customers It did not impact customers in as far as using GitLab the product on gitlab.com relates
How many attempts made to access n/a
How many customers affected n/a
How many customers tried to access n/a

Detection & Response

Question Answer
When was the incident detected? 23:21 UTC
How was the incident detected? @pcalder Reported it to the #handbook-escalation channel in Slack
Did alarming work as expected? There is no monitoring in place to check if the Handbook pages are accessible
How long did it take from the start of the incident to its detection? 46min (22:35 UTC -> 23:21 UTC
How long did it take from detection to remediation? 32min (23:21 UTC -> 23:53 UTC)
What steps were taken to remediate? The cause of the bug was identified and a fix pushed to master
Were there any issues with the response? The Handbook On-call process was used to escalate the incident at 23:21 UTC and the escalation was acknowledged 23:37 UTC by @cwoolley-gitlab. This is a 16min response time, which is within the SLO defined of 60min.

Timeline

2020-06-02

Root Cause Analysis

A deploy job caused all the GitLab Handbook pages to be missing from the production site.

  1. Why? - The deploy job (in .gitlab-ci.yml) did not declare that it needed the build-handbook job resulting in the deploy omitting the Handbook pages
  2. Why? - The bug was not picked up in testing of the MR that introduced the first part of the monorepo restructure.
  3. Why? - Testing did not cover checking the actual deploy to production job
  4. Why? - Testing the deploy job would have required setting up a dedicated test bucket and credentials on GCP
  5. Why? - We don't use a staging environment for about.gitlab.com like one does in the traditional sense
  6. Why? - Since changes to the codebase is mostly content related we want to ship as fast as possible to production so we don't go through a staging verification process before master is deployed to production

What went well

  • Handbook On-call process was triggered and response was provided within the First-response SLO
  • @cwoolley-gitlab was able to investigate and resolve the issue in a timely manner

What can be improved

  • Better monitoring to alert to outages (404s) of pages on the http://about.gitlab.com website
  • Use of staging site and verification process for all major updates to the website

Corrective actions

Edited by Jean du Plessis