Test gap analysis on incident `Container registry pulls failing with unknown blob`

Incident Review

Key Points

📄 Incident Review | 🚒 Incident

Fix MR: gitlab-org/container-registry!1298 (merged) 🔧

Corrective actions

End-to-end test review

We currently have two sets of tests for the Container Registry. One is targeting the self-managed installs and the other is the .com Registry with the new metadata database. The .com is targeting environments such as staging-canary, staging, and pre. We have a test that uses the Browser and another using the API (marked as :reliable, running on sanity pipelines).

There were no reports of failures on these tests during the incident, indicating that these tests have not caught the bug going into production.

Why

Behaviors observed during the incident:

  • pulls failed on retrieving layers or configuration blobs for existing images, after retrieving the manifest from the database
  • pushes succeeded
  • pushing then pulling succeeded (layer links were populated during the push) 👈
  • lists of tags were still available, but had the default published date (which is normally supplied by a configuration blob)

Our tests currently build an image -> push image to registry -> push the same image. This flow populates the layer links. Only once we were on an environment that had images that did not have filesystem layer link metadata, were we able to see the effects of checking for that filesystem data before serving blob data and catch the bug.

To successfully catch the 🐛 here we'd need to have an image in the registry and pull it, before pushing it.

Next steps

We currently create all our required data for the end-to-end tests for the Container Registry as we run them in all environments. Keeping tests as atomic as possible keeps them independent of data that may or may not be prepopulated and improves their reliability.

Having the above in mind the test gap should be done at a lower level of integration where it is most relevant. fix(handlers): disable filesystem layer link me... (gitlab-org/container-registry!1298 - merged) works directly on:

  • TestManifestAPI_Get_Schema2NotInDatabase
  • TestManifestAPI_Put_Schema2WritesNoFilesystemBlobLinkMetadata
  • TestManifestAPI_Delete_Schema2ManifestNotInDatabase
  • TestManifestAPI_Put_OCIImageIndexByTagManifestsNotPresentInDatabase

TestManifestAPI_Put_Schema2WritesNoFilesystemBlobLinkMetadata was added, which directly prevents regression, where the test checks that when a Docker image is pushed to the registry with a database enabled, the image's blob (layer) is not written to the filesystem. When trying to retrieve the blob from a registry with the database disabled (and hence solely relying on the filesystem), it should not be found.

Edited by Sofia Vistas