Incident Review: Container registry pulls failing with unknown blob
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External and internal customers were affected.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Customers were seeing blob not found errors when attempting to retrieve images from the GitLab Container Registry.
- Users could push to the registry, but not retrieve.
-
How many customers were affected?
- 4,663
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
-
During a refactor, we removed migration mode, but we also removed logic which disables writing and reading for blob link FS metadata: gitlab-org/container-registry@v3.71.0-gitlab...v3.73.0-gitlab
This caused serving blobs to fail here, since it tried to Stat their FS metadata: https://gitlab.com/gitlab-org/container-registry/-/blob/v3.73.0-gitlab/registry/storage/blobserver.go#L34
-
Blob unknown errors are common and Docker returns this same error for multiple scenarios.
-
After the MR was merged, the error volume was not impactful enough in pre-prod environments to be noticed during verification.
Incident Response Analysis
-
How was the incident detected?
- An internal engineer detected the failure because it was causing broken master on
gitlab-org/gitlab
- An internal engineer detected the failure because it was causing broken master on
-
How could detection time be improved?
- Have monitoring alert about the errors seen during the incident.
-
How was the root cause diagnosed?
- Data were gathered about the bug behavior during the course of the incident
- We were able to use these data to narrow down the source of the error within the registry code
- A local reproducer was put together based on the above analysis
-
How could time to diagnosis be improved?
- Turning on debug level logging on pre and staging, which were also affected. This would have allowed us to see a filesystem-based access check for the blob, which would have pinpointed the source of the issue.
-
How did we reach the point where we knew how to mitigate the impact?
- We observed that the onset of the incident coincided with the deployment of a new registry version
-
How could time to mitigation be improved?
- Have a fallback to the dependency of the deployment pipeline on the gitlab.com Container Registry service on gitlab.com.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- yes, code change gitlab-org/container-registry#936 (closed)
What went well?
-
The response team quickly identified the recently deployed version of Container Registry as the cause and began a rollback to the previous version.
-
The response team engaged the Container Registry team to assist with determining cause. Several team members from Container Registry were available to join the incident.