Resume Import Enumerations: Step One

Context

Repository Enumerations start at the beginning and continue to the end, with no option to skip any repositories along the way.

In the feedback issue, we have a user with a ~500TiB registry who needed to resume step one from the beginning after running for more than 11 days: gitlab#423459 (comment 1652886708)

Problems

While a registry of this size likely received many new writes during the 11 day period, the importer will mostly be reduplicating effort until it reaches repositories that appear after the last imported repository.

Solution

Read the most recently inserted repository in the database and resume the import starting with that repository.

Discussion

We should probably enable this by default, but we have to be mindful of data going stale. Perhaps we can have this on by default if the last imported repository was created within 48 hours, with options to force a start from the beginning or allow a resume for later timeframes. All this could be achieved via a freshness flag accepting a time duration, with 0 indicating a forced fresh start. This would allow us to reuse the same flag for the other two steps, while also giving those steps a different default freshness.

This step is the least risky of all three to implement — it's fundamentally a performance enhancement and its responsibility is entirely covered by step two.

Repository Enumeration does not support efficient skipping to a particular repository; however, this functionality is already implemented: https://gitlab.com/gitlab-org/container-registry/-/blob/aa5b9c0a174d5922e08eec80828a64dfc213b6f1/registry/storage/catalog.go#L68. We'd need only to surface this option to the Enumeration function. The current implementation starts enumeration after the provided string, so we'll need to manipulate the string to start the retry on the actual last repository, or perhaps pass in the 2nd to last imported repository.

Edited Dec 01, 2023 by Hayley Swimelar