Design an In Place Migration Procedure for Self-Managed Installs

Context 🗺

In order to facilitate self-managed installs to move to database metadata, we should determine a strategy for in place migrations. We currently have a CLI tool for this, but this has been mostly used for testing and development.

We should consider expanding on what we have now to create a procedure intended for self-managed admins.

The Gradual Migration Plan developed for the GitLab.com container registry is a relatively good jumping off point for the development of this plan. General familiarity with the terms and techniques used in the previous plan will provide the reader with invaluable context to understand the procedure being developed in this issue. While we can expect differences in both the deployments and constraints for self-managed migrations, our task remains the same at it's core — faithfully transfer metadata from object storage to the registry database.

Areas to Address 🔍

In contrast to the plan linked above, this procedure must accommodate a wide range of deployments, rather than being explicitly suited for the specifics of one single deployment in particular. This section will serve to highlight the areas in which we can expect significant differences between the GitLab.com migration and future self-managed migrations.

Resource Constraints 🚯

Downtime Component ⏹

As with the migration plan linked above, we are not able to import a repositories tags without preventing writes to that repository. In general, we can assume more flexibility in this area for self-managed deployments than for GitLab.com. While we should reduce this as much as possible, we likely do not need a zero-downtime solution for self-managed. This enables us to pursue a variety of options that were not feasible in the previous migration plan.

Total Migration Time 🏁

Conversely, total migration time starts to become a more salient feature for self-managed. In the most ideal case, this migration procedure can be a discreet, one-time event that occurs within a scheduled maintenance window. To support this, we should provide estimates and/or tools for admins, so they are able to anticipate how long their migration will take. We already have work planned in this area: Test offline migration runtime with test regist... (&8604)

Personnel and Expertise 🤓

The solution implemented for GitLab.com, while automated, was still complex enough to require a fair amount of attention and time from highly contextualized people. Additionally, the migration itself was an important goal for the package stage, allowing this allocation to take place.

In contrast, we can anticipate that the admins who perform the self-managed install procedures will have less context, other priorities competing for attention, and less buy-in from their organization. Therefore, this procedure must not only be more streamlined and centralized. It must also provide a larger safety net in terms of application state.

Compute 💻

The current import logic is capable of a fair amount of parallelization. While this enables a certain amount of speed to be gained, particularly when importing tags after a pre import has completed, this can strain resources. We've seen this issue pop up for offline garbage collection with the S3 storage driver, eventually resulting in us reducing concurrent operations in S3: feat(storage/driver/s3): run DeleteFile batches... (!1159 - merged) Given this, we should seek a somewhat conservative level of parallelism (if any) by default.

Storage Drivers 💾

We need to find a procedure that works the same for all storage drivers. While we should not expect equal performance from each driver, we need to be sure that we're not relying on the particular features of any one driver. For example, in the GitLab.com migration, we took advantage of the efficient copy operations provided by the GCS driver. While this allowed us to avoid cataloging all danging blobs, this approach likely will only work with an acceptable efficacy for instances using GCS as their object storage provider. For these migrations, we should instead include a step that catalogs all blobs without moving them from one prefix or bucket to another.

Deployment Size ⚖

Small Scale — Less Than 10 TiB 🔹

Most deployments should fit into this category. While these migrations should not be a major difficultly, we should pay attention that we are not compromising the UX here too much to accommodate for the needs of larger deployments.

Additionally, these deployments are likely able to configure the registry to use the same Postgres instance that rails does.

Medium Scale — 10 TiB to 100 TiB Ⓜ

These deployments will either behave more like small scale or large scale deployments, depending on the specific constraints of the organization. Organizations that can tolerate more downtime may choose to use simpler techniques that are intended for small scale deployments.

Large Scale — Larger than 100 TiB 🔵

We can anticipate deployments this large are owned by organizations that are sensitive to downtime, but also have more administrative resources. Given this, we can likely introduce some level of complexity to the procedure to optimize for reduced read-only time.

These deployments are also more likely to wish to use a separate Postgres instance for their registry.

Deployment Types 🔀

We should also ensure the procedure is compatible with various GitLab installation methods, such as the following.

Omnibus 🚌

Charts 🚢

Reconfiguration 🔧

The container registry needs to be configured to use database metadata, and the import tool also needs a working connection to the same database to perform the import. We need to ensure that the transition from pure object storage metadata → metadata import → database metadata goes smoothly.

This was less of an issue for GitLab.com as we were running new repositories on the metadata database for a while before starting the import of historic repositories. Not only this, the import tooling was reading from the same configuration as were the running registry nodes. It's likely that the import tool we use will be passed a configuration, much like the offline garbage collector.

Safety ☣

Migrating from object storage metadata to database metadata presents several safety concerns. Of primary importance to us is ensuring data consistency and safety. We should ensure the following:

Object storage data from a deployment using database metadata cannot be used on a registry instance without a database configured
An import is not attempted against a database already containing tags (w/ possible override)
no multiple simultaneous imports against a single database
no container registry writes during the import phase

I believe we can manage some of these safety measure measures via files written to object storage. While object storage metadata like this is imperfect, as we know, it's one of the few ways that we can share information to multi-registry deployments without the database.

For example, once a registry instance comes online with a database, we should write a file to object storage marking the filesystem as needing to be managed by the metadata database. This should prevent situations where the registry dataset is accidentally accessed with filesystem metadata enabled. While we can and should document this, things like stale configs restored during a backup provide pathways for the database to be disabled again by accident. Additionally, the beginning of the import step can write one of these files, ensuring that read-only mode is enabled. We can check for these files on startup and potentially via a health check/heartbeat type process.

We should design these files such as a simple deletion is enough to "reset" the registry state, as it is inevitable that these files will occasionally be left behind. Not only that, but we should write clear log messages indicating the location of the file, why it was written, and how and why to remove it. While the failure mode for these files is inelegant, the additional data integrity they can provide is valuable enough that we should strongly consider their implementation.

We should review and test multiple interruptions during each phase of the import to ensure the robustness of this solution in terms of data integrity. Particularly, the import phase needs heavy validation, as almost all data integrity concerns occur within this phase.

Migration Tool 🔨

We should redesign the database import tool directly. This tool works well, but as it is now it is fundamentally a developer tool. Since only we haved used this tool for our testing and development, making breaking changes allows us to present a simplified interface and ensure that the tool feels purposed-designed for the self-managed import process.

Example features:

Managing database migrations automatically
Default behavior matches the most common "one-shot" import (more on that below)
Manage log levels without needing to update the registry configuration
Include all relevant configuration options in the logs
Simplified output to STDOUT while writing timestamped detail log files
Audit logs detailing failed imports

Migration Procedure ⏫

For the sake of reducing the complexity of the design and to reduce the variance between large and small instances, I'm proposing a three-phase import procedure that should be able to accommodate the needs of small and large deployments. In contrast to the GitLab.com migration, the object data will remain in place, rather than being copied to a new destination.

The phases are as follows:

Pre Import
Import
Catalog Dangling Blobs

The pre import phase is similar to the repository pre import phase, which imports all tagged objects without importing the tags themselves. The difference for self-managed is that all repositories will be pre imported at a single time. Similarly, the import phase will run afterward, importing the tags and any new images pushed during the pre import phase. The final phase is the largest departure from the gradual migration plan used on GitLab.com. This phase will go through the entirety of the blob storage backend, ensuring that any dangling blobs are entered into the database, making them visible for garbage collection.

Of these three phases, only the second phase needs to be read-only, and after it is complete a registry instance using the database can connect to the database while the blob backend is checked for danging references.

One-Shot 🎯

Preparatory Steps 🛂

The user will need to alter the container registry configuration to point to the metadata database, but with the database disabled. The migration tool will use this configuration and in doing so will validate that the database instance is live and configured properly.

Additionally, read-only mode will be configured at this time, or alternatively, the registry could be brought offline.

The Migration 🥅

The user runs the migration command with no options.

Post Migration Steps 🥇

The user will alter the container registry configuration to enable the metadata database. Read-only mode will be disabled at this time, and the registry can be restarted.

Stepped ☑

Preparatory Steps 🛂

Pre Import 🅰

The user runs the migration command with an option to run only the pre import phase.

Import 🆎

Read-only mode will be configured at this time, or alternatively, the registry could be brought offline.

The user runs the migration command with an option to run only the import phase.

The user will alter the container registry configuration to enable the metadata database. Read-only mode will be disabled at this time, and the registry can be restarted.

Catalog Dangling Blobs 🔤

The user runs the migration command with an option to run only the catalog dangling blobs phase. This phase, like pre import, can run while the registry serves requests.

Edited Feb 01, 2023 by Hayley Swimelar