RCA for registry down 2020-02-04

Summary

Service(s) affected: Container Registry
Team attribution: Delivery
Minutes downtime or degradation: 24 minutes

Impact & Metrics

Question	Answer
What was the impact	Service outage of the Container Registry
Who was impacted	All users and CI jobs that depend on the Container Registry
How did this impact customers	Docker images were unable to be pushed nor pulled
How many attempts made to access	517315
How many customers affected
How many customers tried to access

Source

Detection & Response

Question	Answer
When was the incident detected?	2020-02-04 20:51 UTC
How was the incident detected?	Prometheus detecting high rate of Backend connection Errors
Did alarming work as expected?	Yes
How long did it take from the start of the incident to its detection?	2 minutes
How long did it take from detection to remediation?	24
What steps were taken to remediate?	Applied a last known good configuration for the Container Registry
Were there any issues with the response?	Identification of issue was difficult to discover

Timeline

2020-02-04

20:47 - A dry-run pipeline began to test registry change
20:51 - PagerDuty alert signaled increased error rates to the Registry Backend
20:53 - PagerDuty alert signaled the Container Registry serving high rate of 5xx errors
20:53 - PagerDuty alert signaled the Container Registry service is down
20:59 - Incident issue created
21:01 - Engineer familiar with the registry is engaged
21:10 - A destructive CI pipeline is identified
21:11 - Status page updates
21:12 - A previous known good pipeline is ran to apply the correct config
21:13 - CI Job completes
21:14 - Registry manually scaled up to get enough pods quicker than the autoscaler would have
21:18 - Registry is verified working

Root Cause Analysis

An Engineer working to supplement an existing Deployment method for our Kubernetes workloads was utilizing CI to test functionality of work being performed. The issue associated with this work: #655 (closed) The Merge Request currently in WIP: gitlab-com/gl-infra/k8s-workloads/gitlab-com!122 (merged)

The Work In Progress code was not feature complete and therefore, introduced a change to the existing configuration of the Container Registry that prevented the service from properly handling requests. The disruptive change is associated with the Container Registry configuration as seen below:

       auth:
         token:
-          realm: https://gitlab.com/jwt/auth
+          realm: https://gitlab.example.com/jwt/auth

The code being written did not properly contain safe guards, such as the use of dry-run to prevent accidental application of a Work In Progress code change. CI jobs that are expected to run for the purposes of testing currently have the ability to perform destructive actions. Mistakes in code could be prevented by limiting the access that CI jobs may have in these use-cases.

The Deployment was noted as a failure and an attempt to rollback the change was noted to be successful. This was not the case in this situation and requires further investigation.

What went well

Start with the following:

Alerting worked as desired
Dashboards correctly pointed out that Pods were not taking load sufficiently

What can be improved

#662 (closed) - Investigate why the attempted Rollback was noted to be successful, but instead we remained in a state of degradation.
#663 (closed) - Pods were Ready and passing liveness checks, but were not taking any traffic, this requires investigation
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9121 - Finding a way to page IMOC and CMOC to update the status took longer than expected and had to be manually paged. Commands to page are not listed in the incident management.

Corrective actions

All corrective actions are owned by @skarbek

Al corrective actions are due 2020-02-12

#664 (closed) - CI Jobs associated with Branches and expectations where a dry-run should occur, should not have the ability to accidentally perform actions that are destructive in Production environments
#665 (closed) - CI Jobs on branches should not run against production environments at all
#666 (closed) - CI jobs should have protected variables for gprd such that dry-runs are not possible
#667 (closed) - CI pipeline promotion should happen prior to canary

Deadlines

end of work day 2020-02-05: RCA completed
- Rough timeline and some corrective suggestions prior to Company call time 2020-02-05
start of work day EMEA 2020-02-06: Remove confidentiality of this issue start of workday Thursday
end of work day 2020-02-06: Corrective items identified

Edited Feb 11, 2020 by John Skarbek