Docker Buildkit uploading image manifest lists/indexes with invalid references

Problem

In #404 (closed), we learned that it's currently possible to push manifest lists/indexes that reference not manifests but layers. This causes problems with, for example, offline garbage collection.

According to the Docker Image Spec and OCI Image Spec, a manifest list/index should reference manifests and manifests only. Nevertheless, the lack of validation has allowed an unknown number of invalid images to be uploaded to the GitLab Container Registry, both for GitLab.com and self-managed.

In this specific case, the root cause lies with Docker's Buildkit (directly or through buildx) implementation of remote cache images. It has been reported upstream by multiple sources ([1], [2], [3], [4], [5]), and now by us as well ([6]). It's yet unclear if this is going to be addressed or not.

How is it handled by other registries?

I tried to reproduce this against DockerHub, ACR (Azure), GHCR (GitHub), ECR (Amazon), Artifactory, Quay, and GCR (Google). Out of these, only DockerHub, ACR, and GHCR allow these cache images to be pushed. The remaining refused them for the same reason described here. Although I didn't test it myself, I also found a report that this does not work for Nexus (Sonatype) either.

Why are we affected?

There is an old (2016) workaround for a past bug in the Distribution registry (which the GitLab registry is based on) that converged layer and manifest link validations. For this reason, pushing a list/index that references a layer will work just like if it was a manifest. This is not by design but rather a side effect of this workaround. There is a pull request that aims to revert this.

Regardless of this bug, we could have added validation to ensure that only manifests were referenced (by looking at their media type), but we (wrongly) assumed that this would not be technically possible or attempted.

When did this problem started?

The upstream Distribution registry workaround that allowed this bug to sneak in the GitLab Container Registry was created in 2016. The first evidence that this created an unwanted side effect was now with this buildkit issue (we can't rule out others, though). The first buildkit release was in 2018, but it only became stable in late 2020.

Until late 2020, buildkit used a Docker Manifest List for cache images and then moved on to an OCI Image Index. The previous implementation was not compliant with the Docker Manifest List spec either (lists were already referencing layers, not manifests). The only thing that changed with the move to OCI was the media types.

Is it the standard behavior for buildkit to build and push cache images to the registry?

No, it's an optional feature.

It's important to note that the problem is not the target final images but rather the cache ones. Docker has to construct a context (pull the base image, resolve dependencies, etc.). When the image is built, it's pushed to the registry. Optionally, the build context can also be packaged as an image and pushed to the registry, allowing it to be reused in future builds, reducing the build time. The problematic image is only the cache one.

How many of these images exist?

We "can't" currently answer this question, at least not for GitLab.com (due to the size). We should be able to do so once all metadata lives in the database.

What is the current impact?

Besides allowing invalid content (as per the Docker/OCI image specs) to get into the registry:

Until %14.0, we were not validating references of manifest lists during offline garbage collection (just like the upstream Distribution registry). As a result, this would cause referenced manifests and their layers to be garbage collected when they should not. We fixed this by validating manifest list references during offline GC, and this is how we learned about this problem in #404 (closed).

This currently prevents self-managed users who have buildkit cache images on their registries from running offline garbage collection. For GitLab.com, this has no impact, except for the UI. These images are currently shown as broken in the UI. The same applies to self-managed instances:

What is the impact on the registry migration?

The current metadata import tool follows the Docker/OCI spec so that it won't process and migrate these images from the old to the new registry. Brand new cache images will be rejected when attempting to push them to the new registry. The database schema was designed based on the Docker/OCI specs, so it does not adapt to non-conformant images as-is.

If we start refusing the push of these cache images, what is the impact?

Docker builds using this feature would fail. Although the real image can be successfully built and pushed to the registry, both images are built and pushed in a single run so that the CI job would fail.

If it is not possible to pull these cache images, what is the impact?

Docker builds using this feature are slower but succeed. The build won't fail, but the build context has to be rebuilt from scratch.

Solution

Unblocking Offline GC

In regards to #404 (closed), there is an MR (!640 (merged)) to skip invalid references during offline GC. This is not ideal, but it's the best we can do. We'll ship this ASAP.

Damage Control

We should disallow uploads of any other image manifest lists/indexes with invalid references as a first step.

This does not apply to buildkit cache images. They already managed to sneak in, and are produced by an official tool from a widely used provider, so until we have a definitive solution for them, we should let them continue to flow in and out. We can easily identify them through the proprietary application/vnd.buildkit.cacheconfig.v0 media type. When detected, we should log a warning (so that we can count how frequently they are pushed to the registry) and allow them to proceed.

There may be other invalid use cases from other tools, but we have to stop this from getting worse and identify any other (if any) ASAP. Therefore, if there is another unknown tool pushing manifest lists/indexes with invalid references, the push should be refused with an informative 400 Bad Request response and a log message produced (so that we can become aware of them). I'm proposing that we refuse such uploads instead of just logging them to avoid creating yet another new exception (because we don't know if they would be the first instance of a new case or not) that we have to deal with later on.

This behavior should be behind a new configuration parameter, which defaults to true. A new issue should be open for any new occurrence.

Root Cause

We'd like this to be addressed centrally in Buildkit. We've been addressing this here (starting in this comment) and proposed what might be a universal solution here.

If this would change in Buildkit, we could rely on that and request GitLab users to upgrade. Doing so would ensure that any new builds would not be affected (regardless if done against the old or the new registry). For cache images already in the registry, pulling them would continue to work in the current registry and fail in the new one (or maybe not, see the section below), causing the cache to be rebuilt from scratch.

GitLab.com Migration

Phase 1 - New repositories

If the root cause is not fixed in useful time at all, or if it's not feasible/desirable to require users to upgrade to a patched version before we start the migration, we can continue accepting the buildkit cache images as-is but process/store them correctly in the database.

For pushes, we can do so by interpreting the buildkit index as a manifest (with one config and a set of layers) and storing it accordingly (as a manifest, not an index) on the database. This will require a custom workaround for buildkit images in the code, far from ideal, but it would work.

For pulls, it would work just fine because we store the full index payload on the database, unchanged, and that's what we (blindly) serve on pulls instead of constructing the payload by joining the pieces. So the fact that we break it down in a different way on pushes will have no impact. Clients will continue to get the payload that they pushed, and the digest will match.

This method guarantees that online GC, the API (and the GitLab UI as a side effect) will remain fully operational when dealing with these images without any further changes.

For observability reasons (because these images will appear to be like any other for anyone looking at the DB), we should extend the manifests table schema with a non_conformant boolean. We should set it to true to flag that a given manifest was custom processed due to being non-conformant with the Docker/OCI specs. We can then use this column, later on, to identify, repair (e.g., in case a fix upstream is released, we can batch process and convert them to a new/valid form), or delete non-conformant images. This applies to buildkit cache images and any other use cases, hopefully, none.

For how long we'll have to keep this workaround is uncertain, and we should reevaluate periodically. I don't particularly appreciate that we're creating a precedent here, but list references should have been validated initially, so now we have to control the damage and fix what's behind.

Phase 2 - Migrating existing repositories

Hopefully, the route cause should have been addressed by then, and a new version is widely used. If not, then:

For buildkit cache images: We can either migrate them in the same way that they are processed during pushes, or ignore them if we determine that it's OK to force the rebuild from scratch on the client side;
For any other invalid images: We can't rule out the existence of other invalid lists/indexes, for which we may or may not (unlikely) be able to apply the same workaround as for buildkit. For that reason, we should halt/undo the migration of any repositories (and not images, to safeguard the repository integrity in case there are cross-references with other images) with non-conforming images as we detect them. As a result, we'll create a backlog that needs to be dealt with later on. While that doesn't happen, these will remain accessible through the old code path, so no customer impact. We can then analyze each case one by one and devise the best solution.

Edited Jul 28, 2021 by João Pereira