Test gap analysis on incident `Container registry pulls failing with unknown blob`

Incident Review

Key Points

Customers saw blob not found errors when attempting to retrieve images from the GitLab Container Registry
Users could push to the registry, but not retrieve
External and internal customers were affected
RC: During a refactor, we removed migration mode, but we also removed logic that disables writing and reading for blob link FS metadata. This caused serving blobs to fail since it tried to Stat their FS metadata: https://gitlab.com/gitlab-org/container-registry/-/blob/v3.73.0-gitlab/registry/storage/blobserver.go#L34 (deep dive), additional context

Fix MR: gitlab-org/container-registry!1298 (merged) 🔧

Corrective actions

Corrective Action: Improve debug logging around... (gitlab-org/container-registry#1015 - closed) | not yet scheduled
The fix MR improves upon registry/handlers/api_integration_test.go
Helping accelerate release automation with gitlab-org/container-registry!1295 (merged) (quoting from the RCA deep dive: Right now, releasing a new version of the registry to production is a mostly manual process. This means that we tend to wait until we have a reason to cut a version, either we have something we want to see on production, or we have to make a self-managed release. Barring truly automating this process, we should look into cutting a release at a regular cadence if there are any changes, such as every Monday.) | %16.0

End-to-end test review

We currently have two sets of tests for the Container Registry. One is targeting the self-managed installs and the other is the .com Registry with the new metadata database. The .com is targeting environments such as staging-canary, staging, and pre. We have a test that uses the Browser and another using the API (marked as :reliable, running on sanity pipelines).

There were no reports of failures on these tests during the incident, indicating that these tests have not caught the bug going into production.

❓ Why

Behaviors observed during the incident:

pulls failed on retrieving layers or configuration blobs for existing images, after retrieving the manifest from the database
pushes succeeded
pushing then pulling succeeded (layer links were populated during the push) 👈
lists of tags were still available, but had the default published date (which is normally supplied by a configuration blob)

Our tests currently build an image -> push image to registry -> push the same image. This flow populates the layer links. Only once we were on an environment that had images that did not have filesystem layer link metadata, were we able to see the effects of checking for that filesystem data before serving blob data and catch the bug.

To successfully catch the 🐛 here we'd need to have an image in the registry and pull it, before pushing it.

Next steps

We currently create all our required data for the end-to-end tests for the Container Registry as we run them in all environments. Keeping tests as atomic as possible keeps them independent of data that may or may not be prepopulated and improves their reliability.

Having the above in mind the test gap should be done at a lower level of integration where it is most relevant. fix(handlers): disable filesystem layer link me... (gitlab-org/container-registry!1298 - merged) works directly on:

TestManifestAPI_Get_Schema2NotInDatabase
TestManifestAPI_Put_Schema2WritesNoFilesystemBlobLinkMetadata
TestManifestAPI_Delete_Schema2ManifestNotInDatabase
TestManifestAPI_Put_OCIImageIndexByTagManifestsNotPresentInDatabase

TestManifestAPI_Put_Schema2WritesNoFilesystemBlobLinkMetadata was added, which directly prevents regression, where the test checks that when a Docker image is pushed to the registry with a database enabled, the image's blob (layer) is not written to the filesystem. When trying to retrieve the blob from a registry with the database disabled (and hence solely relying on the filesystem), it should not be found.

Edited May 15, 2023 by Sofia Vistas