Gradual migration plan for GitLab.com Container Registry

⚠ This migration plan was designed with GitLab.com in mind. It's not advised to follow this guide for self-managed installs in any environment. Doing so may lead to data loss and/or corruption. The release for self-managed is being tracked in &5521.

Context

The Package team has been working on a new version of the Container Registry that relies on a metadata database (DB) to enable online Garbage Collection (GC) and unblock the implementation of several other features.

With the new version now implemented, we've been working on the rollout, starting with GitLab.com (&5523 (closed)). We have proposed another zero downtime gradual deployment/migration plan in the past, but some concerns have been raised regarding the setup complexity (details).

This issue describes a possible alternative strategy, which does not require an additional bucket, a separate registry instance/cluster, or a new migration service on the registry side. The main tradeoff here is that we give away the maximum level of safety and isolation for a simpler deployment setup and maintenance.

Also, although the intention is to apply this to GitLab.com first, this alternative should also be easier to apply for self-managed installs with huge (hundreds of terabytes) registries that cannot afford a few hours of read-only mode required to perform a simpler one-time offline migration.

Phase 0 - Current state

This is the container registry as it is deployed today, where the only dependency is the storage backend (a GCS bucket for GitLab.com). The metadata DB and online GC functionalities are built-in since v3.0.0 (2021-01-20) but remain dormant, behind feature flags (FFs) disabled by default.

Phase 1 - The metadata DB serves new repositories

Prerequisites

Configuring allow/deny lists (more about that later in gradual rollout). This currently requires a configuration file update, but we may be able to set them on the DB (#355 (closed));
Enabling metadata DB and online GC FFs. This requires a configuration file update.

Terminology

New repository: Repositories that did not exist on the registry before phase 1;
Existing repository: Repositories that existed on the registry before phase 1;
Old code: Source code that handles all operations by relying exclusively on the storage backend (no metadata DB or online GC);
New code: Source code that supports the metadata DB and online GC features.

Expected outcome

All new repositories have their metadata stored and served by the metadata DB and benefit from online GC from the start, keeping their size to the bare minimum.

Storage split

The registry uses two "logical partitions" within the same storage bucket to separate data from existing and new repositories. The split is done using a different root prefix. All objects live under the docker/registry/v2/ root prefix by default. For new repositories, data is stored under a new prefix: gitlab/docker/registry/v2/.

This avoids the need for a separate bucket while still guaranteeing isolation between the two datasets. Doing so enables us to leverage on online GC from the start (see this, under Garbage collection, for the reason why this is required) and provides isolation, guaranteeing that an unforeseen bug on the new code won't be able to cause a data integrity issue on the existing repositories' data.

Gradual rollout

Note: This step is only required for GitLab.com. See self-managed simplifications below for more details.

Based on allow/deny lists, which are a set of regular expressions matched against a repository's path, we can perform a gradual rollout by defining the subset of new repositories that the new code should handle. Besides limiting the impact of early bugs, this also allows us to exclude repositories from paid customers until we're certain that the new code is operating as expected.

The strategy is described below:

GitLab Org: We'll start by testing with some custom-made container repositories under gitlab-org/ and then allow all of its new container repositories to be handled by the metadata database;
General: Following the GitLab Org rollout, we will proceed with a general percentage-based rollout for everyone, excluding major customers (see next);
Major customers: To reduce the risk of impacting users of the GitLab.com container registry that have large amounts of data and/or very high activity, we will exclude these from (2) and handle them in this step with a manual gradual rollout, customer by customer.

All these steps should be done with enough time between them, ensuring that:

We're comfortable about the risk and gain confidence in the system stability;
We don't process customer repositories until we're confident that any major issues would have been revealed already;
We can adjust resources if needed as the load increases, and we learn how the new code and the DB behave.

Routing

Each registry instance will detect whether the target repository is an existing or a new one by analyzing incoming API requests. The repository path is part of the URI of all API endpoints, so the registry can parse that portion of the URI to identify the target repository.

With the repository path in hand, the registry does the following:

graph TD
    A([Check target repository]) --> B{Exists under old bucket prefix?}
    B -- "Yes (existing repository)" --> C([Process with old code])
    C -- "Using old prefix" --> GCS[(Storage backend)]
    B -- "No (new repository)" --> D{JWT token has migration flag?}
    D -- "Yes" --> E{Flag set to `true`?}
    E -- "Yes" --> F([Process with new code])
    F --> DB[(Metadata DB)]
    F -- "Using new prefix" --> GCS
    E -- "No" --> C;
    D -- "No" --> G{Exists under new bucket prefix? *}
    G -- "Yes" --> F
    G -- "No" --> C

* This check is what allows us to pause the migration (if needed to debug a problem) by not adding any more new repositories to the database, while still being able to continue serving requests for those already there.

Note that we check against the storage backend to see if the repository is old or new, instead of using the database. If we did this check against the database, all requests, including the ones targeting old repositories, could be affected by a database outage or performance degradation, as this is the very first step we need to do for all requests. Doing it like this ensures that requests for old repositories are protected from interactions with the database during the time where bugs are more likely to appear, which is Phase 1.

Temporarily mirror metadata writes to new storage prefix

Note: This step is only required for GitLab.com. See self-managed simplifications below for more details.

Besides writing and reading metadata from the database, the new code path writes (but does not read) metadata to the new storage prefix in "parallel" for a limited period. This metadata is mainly composed of tiny files that only contain a SHA256 hash, so their impact on storage space is negligible. This is necessary as a backup safety measure.

Once we're confident enough about the system stability, we can stop mirroring metadata to the storage backend by turning off the corresponding FF (enabled by default when the metadata DB FF is enabled). We can then optionally perform a cleanup on the new prefix to remove all metadata written so far. We could do this by deleting all objects with prefixes gitlab/docker/registry/v2/repositories/**/(_manifests|_layers), using, e.g., gsutil.

Rollback and repeat

As a last resort measure, if we ever find ourselves in a situation where we can't continue to serve requests through the new code path due to a major unexpected issue, we can transfer the data (blobs and metadata) of new repositories from the new bucket prefix to the old one (using, e.g., gsutil rsync). We could then disable the metatada DB FF and serve content through the old code path exclusively.

Once the major issue was debugged and fixed, we could then restart Phase 1.

Self-managed simplifications

We believe the gradual rollout and the temporary metadata write mirroring are only a requirement for GitLab.com. We're starting with the GitLab.com registry to dogfood the new code and the migration process. These steps exist to provide a safety net and guarantee that we can control and limit the impact of any early major bugs.

By the time we're done with the GitLab.com registry migration, the code should be stable and the overall process refined, so it should be safe to exclude these steps for self-managed installs. We can still approach the rollout iteratively by selecting a few known customers willing to give it a try (if they can't tolerate a few hours of read-only mode required for a simpler one-time offline import).

Phase 2 - Migrate existing repositories

Once the new registry deployment has proven to be stable while serving new repositories and performing online GC, it's time to migrate existing repositories gradually. This implies importing their metadata from the old storage prefix into the metadata database. It also requires copying the layer blobs used by each repository from the old prefix to the new one (see this, under Garbage collection, for the reason why this is required).

Prerequisites

Phase 1 completed;
Disable storage backend metadata mirroring FF. This requires a configuration file update.

Expected outcome

The metadata of all existing repositories (before Phase 1) has been imported into the metadata DB, and their blobs are now located under the new storage backend prefix.

Scope

Repositories

The migration will only target repositories known to Rails, i.e., those registered in the container_repositories table on the Rails DB. Rails registers a repository in this table whenever a client obtains a JWT token for communicating with the registry. That token gives permissions to read/write from/to a specific repository. You can read the documentation for a detailed description of the authentication and push/pull flow.

Considering this, we can use the list of container repositories registered on the Rails side as the inventory of existing repositories that should be migrated. Repositories not listed in this table will be ignored and left behind in the old bucket prefix.

It's important to note that not all repositories on the registry storage backend may be registered on the Rails side and vice versa. Some may have been created (and never used again) before the registration logic was added to Rails (several years ago), and there might be inconsistencies due to past bugs. This is the best we can do though. Nevertheless, the data of these (very old and never accessed again in the last several years) repositories, if any, can be retained for a given amount of time (that is for Phase 3).

Repository data

For each repository, the migration will only target tagged images. This means that untagged manifests and the layers referenced exclusively by them will be left behind. Repositories will therefore be in a clean state once migrated and kept that way by online GC.

This should drastically reduce the amount of metadata to be imported into the DB and layers transferred to the new storage prefix, especially as tag cleanup policies become into effect for most repositories before this phase. Edit: This feature is now available for all projects on GitLab.com.

For GitLab.com, based on our experience with other (smaller and thus easier to fully inspect) registries, such as dev.gitlab.org, we expect that only 60-70% of images are tagged. Edit: We now have confirmation of this, as during Phase 1 we're observing that ~35% of newly created images are untagged and then garbage collected (source).

Migration

The migration should be actioned from the Rails side, leveraging Sidekiq workers and the existing tooling (FFs, chatops, monitoring, etc.) to loop over the container_repositories table and invoke the registry API to start a migration and poll for its state.

Leveraging on Rails to control the migration also allows us to prioritize the migration of repositories per namespaces and tier for GitLab.com, such as starting with free namespaces and only then moving up on the list, and prioritizing or excluding specific top-level namespaces from the initial migration phase, similar to what was described for the gradual rollout in Phase 1. Edit: We decided to move forward with a phased rollout like this, starting with some gitlab-org repositories first (for testing purposes), then all repositories on the free tier, then all paid tiers and finally a list of VIP paid namespaces.

Additionally, we can also surface an alert in the Rails UI that a given repository is scheduled or in the process of being migrated so that namespace admins are aware of it, and maybe even give them the possibility to customize their scheduling if they are too large and therefore their migration is expected to require more than just a few minutes of read-only mode (more on that later).

Finally, we can also record the migration status of each repository on the Rails side and use that information to dynamically enable new features on the UI/API that are only available for migrated repositories/namespaces.

Two-pass approach

While a repository is being migrated, it must remain in read-only mode to prevent write operations (which may alter the repository) and guarantee consistency. Currently, the registry only supports a global read-only mode, which applies to all repositories. Still, we could lock them individually as required for this strategy by leveraging on the metadata DB or doing it from the Rails side by refusing to serve JWT tokens with push permissions.

The time required to import a repository largely depends on the number of tags and the number of referenced layer blobs. The higher these numbers, the more it will take to complete. To minimize the time that a repository must remain in read-only mode, we have made it possible to perform a two-pass import:

First pass: Import all referenced repository artifacts (i.e., manifests, blobs) but not tags. This does not require the repository to be in read-only mode.
Second pass: Import all referenced repository artifacts created since the first pass and import tags. This requires the repository to be in read-only mode and must be run ASAP after the first pass to reduce the probability of new artifacts being created between the two.

The vast majority of the time required for an import comes from inspecting image manifests, determining which layers they reference, and then recording their metadata in the database and transferring layers to the new bucket (first pass). By decoupling these two steps, we can reduce the required read-only period to the bare minimum (second pass).

Any assets created between the two passes will be picked up during the second pass. If any manifests or blobs are deleted between the two passes, then so do the corresponding tags, in which case they won't be picked up in the second pass. All manifests and blobs imported during the first pass will be automatically scheduled for review by the online GC (which, by default, happens ~24h later). Therefore, if any of them remain unreferenced/untagged after the second pass, they will be garbage collected.

The effectiveness of the two-pass import was discussed and evaluated in #324 (closed). We can expect the second pass to take around 5 seconds per 1k tags based on our experiments. This is the time a repository would have to remain in read-only mode.

Taking GitLab.com as an example, given the low write rate of the registry (~20 req/s at the time of writing) and the large number of known repositories (+1M), the probability of a conflicting write request for a given repository happening during the second pass for that same repository should be negligible. Nevertheless, a second pass should have a hard limit on its duration to make sure it does not lock a repository for more than a predefined amount of time, the time after which a migration should be canceled and rescheduled. Edit: We settled on a maximum duration of 30 minutes, time after which Rails will trigger an import cancelation request to force it to stop.

Even with a fast second pass, repositories with a huge amount of tags (+50k) won't probably fit within a safe boundary. These should be rare, but we may need to encourage the admins of such namespaces to enable cleanup policies (reducing the number of tags) and/or arrange for a more prolonged read-only period. We should identify a safe limit for how big a repository can be (in tag count) to determine whether we should attempt to migrate them straight away or not. Edit: We decided to introduce a Rails application setting to determine the maximum number of tags that a repository can have to be considered eligible for migration. This is set to 100 by default. We will be gradually increasing this limit and measure performance.

Duration estimates

We're preparing to inventory the GitLab.com registry to determine the average repository size (tag count) and a per-namespace repository distribution (gitlab-com/gl-infra/delivery#1674 (closed)) that may let us predict the duration of this strategy.

This will also allow us to identify repositories that may be above a safe threshold in terms of tag count (e.g., +50k) and reach out to the corresponding namespace admins ahead of time.

Edit: We have performed the referred inventory and shared it here (internal). This was then used to guide the decision making around limits and timing expectations.

Rails Internals

Note that this is not an exhaustive or detailed description, it's a overview of how Rails will drive the migration. A more detailed description and implementation plan will be detailed elsewhere later on. Edit: This was fully detailed in &7316 (comment 792633854).

To start with, a single worker would process a repository at a time. We can leverage concurrency to speed up the migration once the process proves to be stable and reliable;
The worker would manage a state machine for the migration status, using new column(s) on the container_repositories table;
For each repository, the Rails worker would first trigger the pre-import, followed by the final import;
To perform the read-only locking on the Rails side, we adapt the Rails auth API (/jwt/auth) used for the registry to consider the migration status of each repository. If a repository is in the process of being migrated, the API refuses to serve JWT tokens with the push scope. There is a downside here, though, which is that JWT tokens have a validity of 15 minutes, so one may be able to alter data after the read-only lock comes into effect. To account for this we also lock writes on the registry side during a final import;
We limit the migration duration to guarantee that the read-only mode does not disrupt the user experience. Edit: Rails will automatically cancel an import if it does not complete after 30 minutes.

Registry internals

Expose a new API endpoint to start the migration of a given repository (with authz). This includes separate routes for pre-import and import, which the Rails workers need to invoke. Edit: This is documented here;
To avoid using a separate service, we spawn a background goroutine (worker from now on) within a registry instance whenever a migration request is received from Rails. This worker handles either a pre-import or an import request, using the existing import tool internally as a lib.
We limit this to a given number of workers per instance to avoid resource starvation. For example, we start with up to 1 active worker per instance and increase the limit as/if we gain confidence and see space. The (external) load balancing across cluster instances provides a proper distribution of requests. Edit: We settled with a default concurrency of 1 but this is configurable. See the docs for all available configuration options.

To avoid exceeding the limit of workers per instance, each instance responds with 429 Too Many Requests, letting the Rails worker know that it must retry the request until it finds an instance with available slots.
Once a (pre-)import completes with success or error, the registry notifies Rails about it. This is achieved using an async notification sent to the Rails API. The completion of a pre-import acts as the trigger for the final import. Edit: These notifications are documented here;
To account for possible lost notifications, Rails will poll for the status of a migration if it doesn't hear from the registry in 10 minutes from the start of a (pre)import. Edit: The corresponding registry API endpoint is documented here;
To account for possible stalled (pre)imports, the registry accepts a cancelation request from Rails, which is fired (by default) after 30 minutes of the start of the (pre)import. Edit: The corresponding registry API endpoint is documented here.

Observability

We should produce metrics and dashboards to allow monitoring of the overall migration progress. Some valuable metrics would be:

Migration rate (how many repositories were migrated across different timeframes);
Migration duration;
Percentage of migrated repositories vs. the ones left (forecast);
Percentage of canceled/rescheduled migrations;

Migrated repositories

Once a repository is migrated, all requests that target it would automatically follow the new code path as described in the following section.

Routing

During this phase, all requests targeting new or migrated repositories will be routed through the new code path without exceptions. However, we need to continue to serve requests through the old path for repositories that already exist but were not yet migrated, as described in this section.

Because of this, we need to adapt (and simplify) the routing logic implemented for Phase 1 to take into account if an existing repository has been already migrated or not:

graph TD
    A([Check target repository]) --> B{Exists AND is marked as migrated OR native in DB?};
    B -- "Yes" --> C([Process with new code]);
    B -- "No" --> E([Present on Old Prefix?]);
    C --> DB[(Metadata DB)];
    C -- "Using new prefix" --> GCS[(Storage backend)];
    E -- "Yes" --> F([Process with Old Code])
    E -- "No" --> C
    F -- "Using Old prefix" --> GCS;

Note that we no longer need to evaluate allow/deny lists, as those are exclusive to Phase 1. Also, unlike with the Phase 1 routing, we perform the existence check against the database and not the storage backend. If a repository is new, it will need the database to be processed. If a repository is not new, we need to check if it was already migrated or not, and for that, we need to look at the database. Therefore, regardless of the repository status, we will always need to check the database before serving a request, so there is no reason to keep the existence check against the storage backend.

Phase 3 - Cleanup

Prerequisites

Phase 2 completed.
The data retention period has passed.

Steps

Once all repositories are migrated, we can finally remove all objects in the old storage prefix. This could be done after retaining the data for a given amount of time.

It's not yet clear how we could automate the triggering of this step. Perhaps we could have Rails action a cleanup through the registry API once all known repositories have been migrated and the retention period has passed. However, this is potentially dangerous, so a manual activation (maybe from the Rails side, with a rake task, for example) may be a safer bet.

As noted in #374 (comment 582259976), for GitLab.com, we should do this in small steps to not cause performance issues, as it will cause GCS to schedule the deletion of a huge amount of non-current objects.

Edited Oct 04, 2022 by João Pereira