RCA for registry down 2020-02-04
Summary
- Service(s) affected: Container Registry
- Team attribution: Delivery
- Minutes downtime or degradation: 24 minutes
Impact & Metrics
| Question | Answer |
|---|---|
| What was the impact | Service outage of the Container Registry |
| Who was impacted | All users and CI jobs that depend on the Container Registry |
| How did this impact customers | Docker images were unable to be pushed nor pulled |
| How many attempts made to access | 517315 |
| How many customers affected | |
| How many customers tried to access |
Detection & Response
| Question | Answer |
|---|---|
| When was the incident detected? | 2020-02-04 20:51 UTC |
| How was the incident detected? | Prometheus detecting high rate of Backend connection Errors |
| Did alarming work as expected? | Yes |
| How long did it take from the start of the incident to its detection? | 2 minutes |
| How long did it take from detection to remediation? | 24 |
| What steps were taken to remediate? | Applied a last known good configuration for the Container Registry |
| Were there any issues with the response? | Identification of issue was difficult to discover |
Timeline
2020-02-04
- 20:47 - A dry-run pipeline began to test registry change
- 20:51 - PagerDuty alert signaled increased error rates to the Registry Backend
- 20:53 - PagerDuty alert signaled the Container Registry serving high rate of 5xx errors
- 20:53 - PagerDuty alert signaled the Container Registry service is down
- 20:59 - Incident issue created
- 21:01 - Engineer familiar with the registry is engaged
- 21:10 - A destructive CI pipeline is identified
- 21:11 - Status page updates
- 21:12 - A previous known good pipeline is ran to apply the correct config
- 21:13 - CI Job completes
- 21:14 - Registry manually scaled up to get enough pods quicker than the autoscaler would have
- 21:18 - Registry is verified working
Root Cause Analysis
An Engineer working to supplement an existing Deployment method for our Kubernetes workloads was utilizing CI to test functionality of work being performed. The issue associated with this work: #655 (closed) The Merge Request currently in WIP: gitlab-com/gl-infra/k8s-workloads/gitlab-com!122 (merged)
The Work In Progress code was not feature complete and therefore, introduced a change to the existing configuration of the Container Registry that prevented the service from properly handling requests. The disruptive change is associated with the Container Registry configuration as seen below:
auth:
token:
- realm: https://gitlab.com/jwt/auth
+ realm: https://gitlab.example.com/jwt/auth
The code being written did not properly contain safe guards, such as the use of dry-run to prevent accidental application of a Work In Progress code change. CI jobs that are expected to run for the purposes of testing currently have the ability to perform destructive actions. Mistakes in code could be prevented by limiting the access that CI jobs may have in these use-cases.
The Deployment was noted as a failure and an attempt to rollback the change was noted to be successful. This was not the case in this situation and requires further investigation.
What went well
Start with the following:
- Alerting worked as desired
- Dashboards correctly pointed out that Pods were not taking load sufficiently
What can be improved
-
#662 (closed) - Investigate why the attempted Rollback was noted to be successful, but instead we remained in a state of degradation. -
#663 (closed) - Pods were Ready and passing liveness checks, but were not taking any traffic, this requires investigation -
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9121 - Finding a way to page IMOC and CMOC to update the status took longer than expected and had to be manually paged. Commands to page are not listed in the incident management.
Corrective actions
All corrective actions are owned by @skarbek
Al corrective actions are due 2020-02-12
-
#664 (closed) - CI Jobs associated with Branches and expectations where a dry-run should occur, should not have the ability to accidentally perform actions that are destructive in Production environments -
#665 (closed) - CI Jobs on branches should not run against production environments at all -
#666 (closed) - CI jobs should have protected variables for gprdsuch that dry-runs are not possible -
#667 (closed) - CI pipeline promotion should happen prior to canary
Deadlines
- end of work day 2020-02-05: RCA completed
- Rough timeline and some corrective suggestions prior to Company call time 2020-02-05
- start of work day EMEA 2020-02-06: Remove confidentiality of this issue start of workday Thursday
- end of work day 2020-02-06: Corrective items identified

