Test gap analysis on incident `Container registry pulls failing with unknown blob`
Incident Review
Key Points
- Customers saw blob not found errors when attempting to retrieve images from the GitLab Container Registry
- Users could push to the registry, but not retrieve
- External and internal customers were affected
- RC: During a refactor, we removed migration mode, but we also removed logic that disables writing and reading for blob link FS metadata. This caused serving blobs to fail since it tried to Stat their FS metadata: https://gitlab.com/gitlab-org/container-registry/-/blob/v3.73.0-gitlab/registry/storage/blobserver.go#L34 (deep dive), additional context
Fix MR: gitlab-org/container-registry!1298 (merged)
Corrective actions
- Corrective Action: Improve debug logging around... (gitlab-org/container-registry#1015 - closed) | not yet scheduled
- The fix MR improves upon
registry/handlers/api_integration_test.go - Helping accelerate release automation with gitlab-org/container-registry!1295 (merged) (quoting from the RCA deep dive: Right now, releasing a new version of the registry to production is a mostly manual process. This means that we tend to wait until we have a reason to cut a version, either we have something we want to see on production, or we have to make a self-managed release. Barring truly automating this process, we should look into cutting a release at a regular cadence if there are any changes, such as every Monday.) | %16.0
End-to-end test review
We currently have two sets of tests for the Container Registry. One is targeting the self-managed installs and the other is the .com Registry with the new metadata database. The .com is targeting environments such as staging-canary, staging, and pre. We have a test that uses the Browser and another using the API (marked as :reliable, running on sanity pipelines).
There were no reports of failures on these tests during the incident, indicating that these tests have not caught the bug going into production.
Behaviors observed during the incident:
- pulls failed on retrieving layers or configuration blobs for existing images, after retrieving the manifest from the database
- pushes succeeded
-
pushing then pulling succeeded (layer links were populated during the push)
👈 - lists of tags were still available, but had the default published date (which is normally supplied by a configuration blob)
Our tests currently build an image -> push image to registry -> push the same image. This flow populates the layer links. Only once we were on an environment that had images that did not have filesystem layer link metadata, were we able to see the effects of checking for that filesystem data before serving blob data and catch the bug.
To successfully catch the
Next steps
We currently create all our required data for the end-to-end tests for the Container Registry as we run them in all environments. Keeping tests as atomic as possible keeps them independent of data that may or may not be prepopulated and improves their reliability.
Having the above in mind the test gap should be done at a lower level of integration where it is most relevant. fix(handlers): disable filesystem layer link me... (gitlab-org/container-registry!1298 - merged) works directly on:
TestManifestAPI_Get_Schema2NotInDatabaseTestManifestAPI_Put_Schema2WritesNoFilesystemBlobLinkMetadataTestManifestAPI_Delete_Schema2ManifestNotInDatabaseTestManifestAPI_Put_OCIImageIndexByTagManifestsNotPresentInDatabase
TestManifestAPI_Put_Schema2WritesNoFilesystemBlobLinkMetadata was added, which directly prevents regression, where the test checks that when a Docker image is pushed to the registry with a database enabled, the image's blob (layer) is not written to the filesystem. When trying to retrieve the blob from a registry with the database disabled (and hence solely relying on the filesystem), it should not be found.