RCA for registry down 2020-02-04

Summary

  • Service(s) affected: Container Registry
  • Team attribution: Delivery
  • Minutes downtime or degradation: 24 minutes

Impact & Metrics

Question Answer
What was the impact Service outage of the Container Registry
Who was impacted All users and CI jobs that depend on the Container Registry
How did this impact customers Docker images were unable to be pushed nor pulled
How many attempts made to access 517315
How many customers affected
How many customers tried to access

image Source

image Source

Detection & Response

Question Answer
When was the incident detected? 2020-02-04 20:51 UTC
How was the incident detected? Prometheus detecting high rate of Backend connection Errors
Did alarming work as expected? Yes
How long did it take from the start of the incident to its detection? 2 minutes
How long did it take from detection to remediation? 24
What steps were taken to remediate? Applied a last known good configuration for the Container Registry
Were there any issues with the response? Identification of issue was difficult to discover

Timeline

2020-02-04

Root Cause Analysis

An Engineer working to supplement an existing Deployment method for our Kubernetes workloads was utilizing CI to test functionality of work being performed. The issue associated with this work: #655 (closed) The Merge Request currently in WIP: gitlab-com/gl-infra/k8s-workloads/gitlab-com!122 (merged)

The Work In Progress code was not feature complete and therefore, introduced a change to the existing configuration of the Container Registry that prevented the service from properly handling requests. The disruptive change is associated with the Container Registry configuration as seen below:

       auth:
         token:
-          realm: https://gitlab.com/jwt/auth
+          realm: https://gitlab.example.com/jwt/auth

The code being written did not properly contain safe guards, such as the use of dry-run to prevent accidental application of a Work In Progress code change. CI jobs that are expected to run for the purposes of testing currently have the ability to perform destructive actions. Mistakes in code could be prevented by limiting the access that CI jobs may have in these use-cases.

The Deployment was noted as a failure and an attempt to rollback the change was noted to be successful. This was not the case in this situation and requires further investigation.

What went well

Start with the following:

  • Alerting worked as desired
  • Dashboards correctly pointed out that Pods were not taking load sufficiently

What can be improved

Corrective actions

All corrective actions are owned by @skarbek

Al corrective actions are due 2020-02-12

  • #664 (closed) - CI Jobs associated with Branches and expectations where a dry-run should occur, should not have the ability to accidentally perform actions that are destructive in Production environments
  • #665 (closed) - CI Jobs on branches should not run against production environments at all
  • #666 (closed) - CI jobs should have protected variables for gprd such that dry-runs are not possible
  • #667 (closed) - CI pipeline promotion should happen prior to canary

Deadlines

  • end of work day 2020-02-05: RCA completed
    • Rough timeline and some corrective suggestions prior to Company call time 2020-02-05
  • start of work day EMEA 2020-02-06: Remove confidentiality of this issue start of workday Thursday
  • end of work day 2020-02-06: Corrective items identified
Edited by John Skarbek