2021-02-05: Dependency proxy unable to pull images

Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.

Summary

An MR (gitlab-org/gitlab!52805 (merged)) was deployed to production 40-60mins ago, breaking our Dependency Proxy feature. A code revert (gitlab-org/gitlab!53506 (merged)) and redeploy to production should fix the impact for customers.

Timeline

All times UTC.

2021-02-05

15:14 - @rymai notices a problem with Docker image pulling from our Dependency proxy (that fortunately, we started using ourselves on 2021-02-03: gitlab-org/gitlab!51726 (merged)): https://gitlab.slack.com/archives/CR6QH3D7C/p1612538058134000
15:32 - @rymai sets a gitlab-org/gitlab variable GITLAB_DEPENDENCY_PROXY=registry.hub.docker.com/library/ to fix pipelines for gitlab-org/gitlab: https://gitlab.slack.com/archives/CR6QH3D7C/p1612539167137500?thread_ts=1612538102.134900&cid=CR6QH3D7C
15:41 - @godfat-gitlab creates an issue to track the problem: https://gitlab.slack.com/archives/CR6QH3D7C/p1612539717139800?thread_ts=1612538102.134900&cid=CR6QH3D7C => gitlab-org/gitlab#301197 (closed)
15:42 - @rymai declares incident in Slack.

Corrective Actions

gitlab-org/gitlab!53667 (merged) - Root cause fix of the original MR that caused the incident
- Estimated completion date: 2021-02-23
- DRI: @sabrams
gitlab-com/www-gitlab-com!75193 (merged) - package group to use cloud provider object storage when testing locally.
- Estimated completion date: 2021-02-23
- DRI: @sabrams
gitlab-org/gitlab#322190 (closed) - Explore object storage QA opportunities for package stage
- Estimated completion date: TBD
- DRI: @sabrams

Click to expand or collapse the Incident Review section.

Incident Review

Summary

From 2020-02-05 14:48 UTC until 2020-02-08 22:12 UTC, GitLab.com experienced an outage of the Dependency Proxy feature. The underlying cause has been determined to be a specific detail that only occurs when using GCP for object storage that was not caught in development testing. The change was rolled back, allowing the Dependency Proxy to resume functionality while a fix was worked on separately.

Service(s) affected: Dependency Proxy (web)
Team attribution:
Time to detection: 26 min.
Minutes downtime or degradation: 3 days where the feature was not functional on GitLab.com.

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Any customers of GitLab.com that use the Dependency Proxy, of which, the largest consumer at the time of the outage was internal customers.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Prevented pipelines from running.
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

"5 Whys"

Why was this not discovered in development or review?
- The bug was specific to GCP object storage. The developer and reviewers that tested the feature tested using the GDK configured with local file storage, and MinIO object storage, both of which are not affected by the bug.
Why don't we test against various cloud providers in either E2E tests or when manually testing features like this?
- This is a good question that will be reviewed and will result in some corrective actions.
Why was this not noticed before pipelines failed?
- gitlab-org/gitlab pipelines are constantly running, so even with monitoring, it is likely that those pipeline failures would have been what triggered monitoring alerts.
Why was this not fixed on the day it occurred?
- Because there was a workaround and the revert MR depended on an autodeploy, the next possible autodeploy wasn't until Monday, so the incident persisted through the weekend.

Incident Response Analysis

How was the incident detected?
1. gitlab-org/gitlab pipelines began failing.
How could detection time be improved?
1. If we had monitoring of this feature with specific alerts, we may have caught it a little faster.
How was the root cause diagnosed?
1. Investigation from a group of backend engineers in the package stage with domain expertise.
How could time to diagnosis be improved?
1. We could set up monitoring of Dependency Proxy endpoints, at the time of incident, there was no specific monitoring for the feature.
How did we reach the point where we knew how to mitigate the impact?
1. We quickly realized the fix was not simple and that a revert would make the most sense.
How could time to mitigation be improved?
1. A revert could have been discussed more as soon as there was uncertainty about whether or not we could identify the root cause and prepare a patch.
What went well?
1. We were lucky enough to have a few experts all online at the same time to help diagnose the problem, suggest workarounds, and prepare the revert.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Not specifically, although we have seen bugs that are isolated to specific object-storage providers (GCP, AWS S3, Azure, etc).
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes: gitlab-org/gitlab#290944 (closed)

Lessons Learned

The testing strategy of the package stage should be reviewed. Local testing is often tested against only a few storage configurations and does not include cloud storage providers. Package features often rely on object storage, so we should create realistic environments when testing locally: gitlab-com/www-gitlab-com!75193 (merged).

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Feb 19, 2021 by Steve Abrams