Incident Review: Container registry pulls failing with unknown blob

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Who was impacted by this incident? (i.e. external customers, internal customers)
1. External and internal customers were affected.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Customers were seeing blob not found errors when attempting to retrieve images from the GitLab Container Registry.
2. Users could push to the registry, but not retrieve.
How many customers were affected?
1. 4,663
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

During a refactor, we removed migration mode, but we also removed logic which disables writing and reading for blob link FS metadata: gitlab-org/container-registry@v3.71.0-gitlab...v3.73.0-gitlab

This caused serving blobs to fail here, since it tried to Stat their FS metadata: https://gitlab.com/gitlab-org/container-registry/-/blob/v3.73.0-gitlab/registry/storage/blobserver.go#L34
Blob unknown errors are common and Docker returns this same error for multiple scenarios.
After the MR was merged, the error volume was not impactful enough in pre-prod environments to be noticed during verification.

How was the incident detected?
1. An internal engineer detected the failure because it was causing broken master on gitlab-org/gitlab
How could detection time be improved?
1. Have monitoring alert about the errors seen during the incident.
How was the root cause diagnosed?
1. Data were gathered about the bug behavior during the course of the incident
2. We were able to use these data to narrow down the source of the error within the registry code
3. A local reproducer was put together based on the above analysis
How could time to diagnosis be improved?
1. Turning on debug level logging on pre and staging, which were also affected. This would have allowed us to see a filesystem-based access check for the blob, which would have pinpointed the source of the issue.
How did we reach the point where we knew how to mitigate the impact?
1. We observed that the onset of the incident coincided with the deployment of a new registry version
How could time to mitigation be improved?
1. Have a fallback to the dependency of the deployment pipeline on the gitlab.com Container Registry service on gitlab.com.

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. yes, code change gitlab-org/container-registry#936 (closed)

The response team quickly identified the recently deployed version of Container Registry as the cause and began a rollback to the previous version.
The response team engaged the Container Registry team to assist with determining cause. Several team members from Container Registry were available to join the incident.

Edited May 23, 2023 by Rehab