Failed pre-imports due to invalid manifest references succeed on retry
Context
Found during gitlab#350920 (closed).
Problem
While migrating existing repositories, I noticed one repository pre-import failed due to #651 (closed) (expected), but then the subsequent retry succeeded (unexpected). The underlying (invalid) image that halted the first pre-import hasn't changed, and therefore the retry should have failed.
The problem identified is related to manifest lists, which represent less than 2% of the pushed/pulled images from the .com registry (source). For context, when encountering a manifest list, we do the following:
- Check if the manifest list is already on the metadata DB. If so, skip (problem here); otherwise, proceed to the next step;
- Create manifest list record on DB;
- Pre-import each of the manifests referenced in the list.
The problem is that currently, we always skip the pre-import of a manifest (list) if it's already present on the metadata DB. In this case, the manifest list record will already exist on the DB after the first attempt, so we don't try to re-import its (potentially broken) references. Consequently, a broken manifest list will lead to a pre-import failure on the first attempt but will not stop a retry from succeeding.
Although we haven't faced this so far, importing atomic manifest with invalid config/layer references can also be a problem. Those would succeed on retry for the same reason.
Finally, this is only a concern for pre-imports, not for final imports. That's because final imports are wrapped in a database transaction, which is rolled back in case of failures, so every final import starts from scratch.
Solution
Always fully pre-import manifest references, even on retries. The database queries and storage operations used during a pre-import are idempotent, so there is no problem. However, there is no need to re-import the same manifest multiple times per pre-import. For example, if the same manifest is tagged 100 times, we should only fully import it once, not 100 times. For this reason we should keep a list of imported manifests per pre-import and only import each once.
Data Repair
As for the imports that (wrongly) succeeded on the second attempt due to this bug, looking since the beginning of the migration, we only see 11 occurrences of this error (logs). Looking at these on the backend, I can see that they are all related to buildkit cache images due to #651 (closed).
Additionally, while these repositories were marked as successfully migrated when one of their images was broken, these images are already broken on the old registry. Trying to pull them leads to a not found error there. Therefore, I see no need to do a data repair.
Regardless, for future reference, and as log retention is limited to 7 days, I have collected the list of concerned repositories, tags, and manifest digests so that we can refer to them later if needed (internal).