Gradual migration proposal for the GitLab.com container registry

Note: This proposal has been superseded by #374 (closed).

Context

As part of &2313 (closed), we need to develop plans and tools to enable and support the online migration of large registries to a new instance/cluster backed by a metadata database.

Existing registries store the repositories' metadata in the object storage backend (along with the deduplicated layer blobs). Once the metadata database is in place, it's necessary to import the existing repositories' metadata from the storage backend into the database, online, without stopping or setting the whole registry to read-only mode during a prolonged time.

This issue describes a strategy in which large registries may be migrated with zero downtime. Although the strategy is generic enough and can be adapted and likely automated for registries in self-managed instances later on, the immediate use case is GitLab.com.

Migration Strategy 💡

Phase 0 — Current State 🗄

This is the container registry as it is deployed today: a single cluster accessible at registry.gitlab.com, with a single object storage bucket (GCS) to where metadata and blobs are written and read from.

Beyond access to the registry HTTP API, clients (GitLab application, Docker CLI, or others) also gain direct access to the GCS bucket for blob upload and download requests using signed URLs, provided by the registry (storage redirect). The current GCS bucket name is gitlab-<env>-registry.

Phase 1 — New repositories Use Database Backed Registry 👶

This phase requires deploying a new cluster of container registry instances backed by the metadata database, side by side with the existing cluster. The new registry exclusively serves requests targeting new repositories. Requests targeting existing repositories continue to be served by the existing registry.

To achieve this, the existing registry remains the single entry point for all requests, but proxies requests for new repositories to the new registry, which has the metadata database and a new storage bucket, named gitlab-<env>-container-registry.

The request handling process is transparent for clients, as they have no visibility of the new registry. All requests continue to flow through registry.gitlab.com. The exception is that they have visibility over the new storage bucket due to the signed URLs for blob upload/download requests, which include the bucket name (e.g. https://storage.googleapis.com/gitlab-gprd-container-registry/...).

The registry database will be hosted on a new and fully dedicated PostgreSQL 12 cluster (gitlab-com/gl-infra&315).

Proxy

No external reverse proxy is required for this solution. Proxying is handled internally by the registry instances on the main cluster, listening at registry.gitlab.com. All API requests (except repository listing, which we don't use) have the target repository name as URL path prefix (/v2/<name>/...). This enables the main registry instances to determine whether the target repository already exists (by looking at the current blob storage bucket) and fulfill or proxy the request accordingly. The execution flow is detailed in the diagram below:

sequenceDiagram
  autonumber
  participant C as Client
  participant R1 as Registry
  participant B1 as GCS Bucket
  participant R2 as Registry 2
  participant B2 as GCS Bucket 2
  participant DB as Database
	C->>R1: HTTP /v2/<name>/...
	R1->>B1: The repository<br>with <name> exists?
	B1->>R1: Yes/No
	alt Yes
    %%rect rgb(229, 255, 204)
      R1->>B1: Fulfil request
      R1->>C: Response
    %%end
	else No
    R1->>R1: Does <name> match against the include list<br>AND<br>not against the exclude list?
    alt Yes
    		R1->>R2: Proxy request
        R2->>DB: Fulfil request
        R2-->>B2: Push/pull blobs<br>(if applicable)
        R2->>R1: Response
        R1->>C: Proxy response
    else No
    	R1->>B1: Fulfil request
      R1->>C: Response
    end
end

This feature was implemented in #218 (closed) and later augmented with support for include/exclude lists of regular expressions in #250 (closed). A new repository is only proxied to the private registry if its name match against the include list and does not match against the exclude list. The include/exclude lists are part of the registry configuration file and can be adjusted whenever necessary.

Please see the documentation for additional details.

The proxy mode should then be enabled for (and only for) the main registry instances listening at registry.gitlab.com.

Gradual rollout

Based on the proxy include/exclude lists, we can perform a gradual and controlled rollout by including/excluding subsets of new repositories that should be proxied to the new registry.

A sample strategy is described below:

Start by excluding all new repositories except those under a random gitlab-org subgroup that we can create for testing (e.g. gitlab-org/registry-testing/.*). This guarantees that we can test the GitLab Rails <-> Registry integration without any risk of impacting customers or any of our real repositories;
Expand the include list to match all new repositories for non-critical gitlab-org subgroups;
Expand the include list to match all new repositories under gitlab-org;
Expand the include list to match new repositories for a selected number of low-risk customer namespaces;
Expand the exclude list to match VIP customers' namespaces for whom the container registry is a critical dependency. We can opt to leave them out until a given point in time to reduce the risk even more;
Expand the include list to match all new repositories whose path starts with z, then y, x, etc. (all alphanumeric characters).

All these steps should be done with enough time between them to ensure that:

We all feel comfortable about the risk;
We gain confidence in the system stability;
We do not start touching the customer repositories until enough time (and load) has passed to be confident that any major issues would have been revealed already. If anything major is spotted on the way, we should be able to revert and sync any new data into the private registry in a short period of time;
We can adjust the private registry resources accordingly as the load increases, and we learn how it behaves in a realistic environment.

Temporarily mirror metadata write requests to new bucket

Besides writing and reading metadata from the database, the new instances should continue to write (not read) metadata to the new blob storage bucket in "parallel" for a limited period. This metadata is mainly composed of files that only contain a SHA256 hash, thus tiny, so their impact on storage usage is minimal.

This is necessary as a backup safety measure. If we ever find ourselves in a situation where we can't continue to serve requests through the new instances (due to a major unexpected issue with the metadata database), as a last resort measure we can transfer all data (blobs and metadata) of new repositories from the new bucket to the old one (using e.g., gsutil rsync). We can then disable the proxy mode and serve all requests through the old instances only.

Once we're confident enough about the new instances' stability and the database in production, we can stop mirroring metadata to the storage backend by turning off the corresponding configuration flag. We can then perform a cleanup on the new bucket to remove all metadata written since the bucket was created. We can do it easily by deleting all objects with prefixes docker/registry/v2/repositories/**/_manifests and docker/registry/v2/repositories/**/_layers.

Phase 2 - Existing Repositories Migrate to the Database Backed Registry ♻

Note: Phase 2 can only begin when Phase 1 completes, i.e., all new repositories are being proxied to the new registry.

Once the new registry deployment has proven to be stable while serving new repositories and performing online garbage collection, it's time to gradually move existing repositories from the old registry to the new one. This implies importing their metadata from the old storage bucket into the metadata database. Additionally, it also requires copying the (deduplicated) layer blobs used by each repository from the old bucket to the new one.

An important aspect is that this migration only targets tagged images. This means that untagged manifests and the layers referenced exclusively by them will be left behind. Migrated repositories will therefore start in a clean state on the new registry. This should drastically reduce the amount of storage to be transferred, especially if the tag cleanup policies become into effect for most repositories before this phase, which would be ideal.

Inventory

Before migrating existing repositories, we must first determine how many exist and their namespace. Once Phase 1 completes, no new repositories will be created in the "old" bucket anymore, which means we can safely scan it to enumerate the existing repositories.

This exercise's output will be a list of repositories (their path/namespace, e.g. gitlab-org/gitlab) sorted in lexicographic order. Once we have this list, we can correlate it with the GitLab database to augment it with additional data, such as the corresponding namespace tier.

Finally, this list should be loaded into the metadata database on a special temporary table, where we'll record the repository path, the migration priority/order, start and completion import timestamps.

For self-managed instances, if any require an online migration approach (opposed to a one-off offline migration), it's likely that we can make this an automated process (creating the inventory and loading it into the database). For GitLab.com, due to the registry size and given that we want to apply custom filters and priorities based on the users' namespace, this will need to be a manual step.

Data migration

Once we have the repositories list, we need to iterate over it and migrate repositories from the old registry into the new one. We should start by cherry-picking the gitlab-org/** repositories, and only then continue with the automated migration of the remaining ones. Namespaces of VIP customers can be left for the end of the migration to reduce risk.

The migration should be carried out by a separate service (see #319 (closed)), independent of any registry instance (old or new). This service should have access to both old and new storage buckets and the new metadata database, where the list of repositories to migrate is recorded and their migration status.

We have already developed an import tool that can scan repositories in a storage bucket and import their metadata into a database, as well as optionally copying blobs from an old bucket to a new one. What's left to implement is the ability to execute this in a configurable cadence, on a per-repository basis, and keeping the migration status updated on the database. This will be the responsibility of the migration service, which will embed the import tool.

Two-pass import

While a repository is being migrated from the old registry to the new one, it must be in read-only mode to prevent write operations (which may create or alter the repository) and guarantee consistency. Currently, the registry only supports a global read-only mode, which applies to all repositories, but we can lock them individually as required for this strategy.

The time required to import a repository largely depends on the number of tags and the number of referenced layer blobs. The higher these numbers, the more it will take to complete. To minimize the time that a repository must remain in read-only mode, we have made it possible to perform a two-pass import:

First pass: Import all tagged repository assets (i.e., manifests, blobs) but not tags. This does not require the repository to be in read-only mode.
Second pass: Import all tagged repository assets and tags. This requires the repository to be in read-only mode and must run immediately after the first pass to reduce the probability of new assets being created between the two.

The vast majority of the time required for an import comes from inspecting the tagged image manifests, determining which blobs they reference, and then recording their metadata in the database and transferring blobs to the new bucket (first pass). By decoupling these two steps, we can reduce the required read-only period to the bare minimum (second pass).

Any assets created between the two passes will be picked up by the import tool during the second pass. If any manifests or blobs are deleted between the two passes, then so do the corresponding tags. If a tag is deleted between the two passes, then it won't be picked up in the second pass. All manifests and blobs imported during the first pass will be automatically scheduled for review by the online garbage collector. Therefore, if any of them remain unreferenced/untagged after the second pass, they will be garbage collected.

The effectiveness of the two-pass import was discussed and evaluated in #324 (closed). Based on our experiments, unlike the first pass, which requires several minutes, we can expect the second pass to take around 5 seconds per 1k tags. This is the time a repository would have to remain in read-only mode.

Given the low write rate of the GitLab.com registry (~21 req/s at the time of writing), the probability of a conflicting write request for a given repository happening during the second pass of that same repository is negligible. Nevertheless, a second pass should have a hard limit to make sure it does not lock a repository for more than a predefined amount of time (TBD), the time after which an import would be canceled.

Even with a fast second pass, repositories with a huge amount of tags (e.g., +50k) won't probably fit within a safe boundary. These should be rare, but we may need to postpone their import to a later date and arrange for a more prolonged read-only period with the customers that own the corresponding namespaces. The intention is to define a safe limit for the number of tags of a given repository to determine whether we should import them or skip them until a later date.

We're currently evaluating doing an inventory to determine the average repository size (number of tags) and a per-namespace repository distribution that would let us predict the results of this strategy and possibly reduce the scope of the migration (https://gitlab.com/gitlab-org/container-registry/-/issues/320 - internal).

Proxy requests for migrated repositories

As described in the previous section, a repository must be flagged as "migrated" in the state store once it's successfully migrated to the new registry. The proxying logic described previously must therefore be updated to take a "migrated" flag into account when deciding if a request should be fulfilled or proxied to the new registry. If a repository exists in the old registry, we must then check if it was already migrated or not. If yes, the request should be proxied to the new registry, otherwise, it should be fulfilled.

As an alternative, we could simply delete each repository's metadata from the main registry once the migration completes. By doing so, when the subsequent request targeting that repository arrives at the main registry, its metadata would no longer be in the storage backend. Therefore it would be considered as new/unknown, and the request proxied to the new registry. However, by doing so, we would lose the ability to use the "original" metadata as a backup (in case it's needed to debug a migration issue or for a rollback).

Observability

We should produce metrics and dashboards to allow monitoring of the overall migration progress. Some valuable metrics would be:

Migration rate (how many repositories were migrated across different timeframes);
Percentage of migrated repositories vs the ones left (forecast);
Percentage of invalidated imports (due to their size exceeding the predefined time boundaries);
Minimum, median, and maximum run time per migration.

Phase 3 - Remove the Original Registry Deployment 💥

Once all repositories are migrated, we can finally promote the new registry, shutdown the old one, and delete its GCS bucket.

FAQ

Have we considered alternatives to a side-by-side deployment?

We have considered and tested a different approach that did not require new/separate registry instances and a new bucket. This was first described in #165 (closed), implemented in #167 (closed) and later removed. In short, for write requests, the registry would write metadata to both filesystem (storage backend, usually object storage) and the metadata database at the same time. For read requests, it would first attempt to read from the database and fallback to the filesystem if not found, then backfilling the database with any missing data. Theoretically, the process would end when there were no filesystem metadata left to be imported into the database.

An approach like this revealed several problems:

Listing requests: If a portion of the metadata is in the filesystem and another in the database, we cannot provide a consistent response to listing requests. For example, to list tags in a given repository, we would need to scan both the storage backend and the database, merging the two results. Apart from performance (the tag list UI is already the slowest one), this would make it impossible to, for example, perform any pagination.
Composite operations: For example, when a manifest is uploaded during an image push, we have first to validate that any of the blobs referenced in the manifest exist and are linked to the target repository. If the metadata is not all in a single place, we have to make multiple requests to different backends to find and conciliate any related data. If any of the referenced blobs are not yet registered in the database but do exist in the filesystem, we would have to pull those from the filesystem, parse and register them in the database before accepting the manifest upload. All of this would need to happen synchronously and during a single HTTP request, creating a very significant delay. Furthermore, to guarantee consistency, we should perform all inserts on the database within a transaction for composite operations. Making requests to remote systems while holding a database transaction open is an anti-pattern.

Apart from performance concerns, consistency is also in danger. Because the registry supports multiple storage backends and each provides a different set of consistency guarantees (e.g., read-after-write or read-after-delete), we could end up with inconsistent data if wrapping and making the database operations dependent on lookups against the storage backend. These would be not only a problem but also hard to spot and debug.
Completion: If dealing with an online migration and a separate bucket is not used, how and when do we know we are done with the migration? New data will continue to be written to the current bucket, so we end up with a moving target. Scanning and comparing both datasets would be prohibitively slow.
Garbage collection: To work, garbage collection needs to know two main things: All manifests, blobs and tags that exist in the whole registry. Which repositories are using which blobs (as they are shared). Only after knowing these things, it's able to determine what is eligible for deletion or not. Because of this, if we're not using a separate bucket for the new registry, and we don't know everything that is there and to which repositories it belongs, it's not possible to use garbage collection until the very end of the migration because we can't answer these questions.

Additionally, the online garbage collector we developed is a continuous and reactive process by design. It garbage collects assets as they become dangling. This is only possible if the garbage collector can answer these two questions at any given time. Even if it was possible and we opted to leave garbage collection disabled until the very end of the migration, we could end up in a situation where we have done all that work just to find out that the production workload and usage patterns have revealed a significant issue which will prevent it from working as expected. A good portion of the database schema was designed to facilitate and improve online garbage collection efficiency. Waiting until the end of the migration to verify that our expectations match the reality would be high risk and counterproductive.
Split-brain: Trying to maintain and synchronize two separate data sets (filesystem metadata and database metadata), which overlap at some point, drastically increases the chances for a split-brain situation. The risk becomes even higher if considering the different consistency guarantees and behaviors of different storage backends, which is not a problem for GitLab.com, as we use GCS, but could be for self-managed instances. Debugging issues become increasingly complex, and any reconciliation or data repairs may require a full scan of both datasets, which would be prohibitively slow.
Availability and integrity: The new registry includes two new major features, the metadata database and online garbage collection, where the latter depends on the former, and both required extensive code changes. While we managed to preserve the current behavior intact (the new code has been in production for several months now but dormant), enabling both of these features on the main and only GitLab.com registry cluster is a significant risk. An unforeseen bug or under-optimized code or query could impact all users.

All these limitations and concerns led us to pursue a dual approach where we focused on enabling one-off offline migrations (applicable for self-managed instances) and a gradual migration based on a side-by-side deployment that we can safely dogfood for GitLab.com. The intent is to guarantee that we can roll out the two new major features together and enabled from the start in a controlled and gradual way, shielding customers with existing repositories from possible availability or integrity risks that may arise from unforeseen bugs and limitations.

Do old and new registry clusters use the same application version?

Yes. The metadata database and online garbage collection features sit behind a configuration flag (disabled by default). For the current registry cluster, this configuration flag will remain disabled, while for the new registry cluster it will be enabled.

Could we avoid a per-repository read-only period for the data migration (phase 2)?

We have considered a different strategy than the one described in this section that would only perform a soft read-only lock of repositories. A soft lock would not prevent write requests. Instead, whenever a repository was in soft lock mode, as soon as a write request arrived for that same repository, the registry would accept and process that request as usual. It would then signal that a write occurred during the soft-lock period, and as such, the repository needs to be imported once again.

This would be technically feasible but is considerably more challenging and complex. We can reconsider such an approach if unable to migrate a substantial amount of repositories due to the requirement for an extended hard lock/read-only mode.

Advantages 📈

The existing registry cluster remains unchanged, except for the proxy mode. It does not use any of the new code required to manipulate the metadata database. As such, it's immune to any related source code bugs or database issues that may appear during the migration;
Shield users of existing repositories from any unforeseen availability or consistency threats that may appear during phase 1;
Controlled and gradual rollout with high granularity, starting only with new repositories;
Existing repositories are migrated in a clean state, reducing the amount of data to be transferred and enabling continuous online garbage collection in the new registry;
Ability to fallback to the old registry in case we're unable to continue serving requests through the new one;
Minimizes the risks of data loss.

Disadvantages 📉

Introducing and maintaining another registry cluster may require a considerable effort from Infrastructure;
Phase 2 will likely span across several months, and its completion date is somewhat unpredictable (e.g., the next 100 repositories may be substantially larger than the previous 1000). This is not a disadvantage of this specific plan but rather something to keep in mind;
Until 100% of repositories are migrated, we will have duplicated data across two GCS buckets, increasing our storage costs;
Additional development effort is required to implement the described ~~import tool, proxy mode~~ (done), and the migration service;
Huge repositories may require multiple migration retries. We may have to devise a plan if these start to accumulate over time;
Cross repository blob mounts between old and new repositories won't work. Fortunately, as documented here, if a blob mount fails, clients automatically fallback to the default blob upload approach, so requests will not fail, they will just take longer to complete (and consume more bandwidth).

Relevant Links

Push/pull request flow - This is a good overview of all the HTTP requests that might happen during docker push and docker pull commands.
Estimated database size requirements
Estimated database query rate requirements
Continuous, on-demand online garbage collection - Overview of the current offline garbage collector and how we tackled online garbage collection.

Edited May 24, 2021 by João Pereira