Container Registry database migrations and deployment strategy

Context

As part of gitlab-org&2313 (closed), we've been implementing a metadata database for the GitLab Container Registry. It was decided that this is going to be a separate PostgreSQL database, owned exclusively by the registry (see gitlab-org/container-registry#93 (closed) for details on that decision). As shown by the task list in gitlab-org&2313 (closed), so far we have designed the database schema, mirrored the API read/write operations to the database, created a tool to scan and import metadata from an existing registry and done some analysis around what we expect to be the size and rate requirements of a database for the gitlab.com registry.

As part of gitlab-org/container-registry#104 (closed), we have now started to review/discuss the database schema with the Database team (gitlab-org/container-registry!269 (closed)) to identify and incorporate any changes that might be required for partitioning.

In parallel with the Database team's discussion, and as a follow-up from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10109#note_345393046, we would like to discuss the deployment and schema migration strategy with the Delivery team (and any other Infrastructure team that might be interested).

Implementation Details

The Container Registry is written in Go. The only other Go application within GitLab that relies on a PostgreSQL database is Gitaly Praefect.

Like Praefect, the Container Registry doesn't use any ORM, only the database/sql package from the Go standard library, a PostgreSQL driver (lib/pq) and raw SQL queries.

The schema migrations are in plain SQL, and like Praefect, internally, the registry manages migrations with the rubenv/sql-migrate tool. We have created a CLI to expose the admin functionalities (docs).

Right now, the database functionality is behind a feature flag, which can be toggled using a configuration setting (database.enabled: true) or the corresponding environment variable (REGISTRY_DATABASE_ENABLED=true). The rest of the database configuration settings can be found in the docs.

As already mentioned, we have developed a tool that allows us to scan and import metadata from a registry bucket and populate a database from scratch. So far, we have tested this against a copy of the dev.gitlab.org registry available in GCP. The import procedure took ~6 hours. This is not feasible for a large repository like gitlab.com, as it doesn't support phased/resumable imports (for now) and requires downtime. However, it's enough to provide us with a realistic test database based on the dev.gitlab.org registry.

Purpose

Currently, the registry still needs the metadata on the filesystem to operate, as it's "just" mirroring reads/writes to the database. This allows us to validate that the database is in a consistent state, and the metadata in it matches what exists in the filesystem.

We're now in a position where we would like to start testing the registry in a realistic environment with a reasonable workload, increasing our confidence in the solution, fine-tune the implementation and experiment with possible approaches for a better metadata migration/import procedure and garbage collection.

We would like to discuss the deployment and schema migration strategy, targeting a new or an existing test environment, preferably based on dev.gitlab.org.

We could do something like the following:

Deploy the registry with the database feature flag enabled. For reads, the registry would attempt to read data from the database first, and if not found, it would fallback to the filesystem. For writes, it would write to both the database and filesystem (gitlab-org/container-registry#167 (closed)). This would allow us to observe the performance of the registry and the database and validate the integrity and consistency of the metadata in both backends.

We would do several deployments during this stage to improve the implementation and address any possible issues (#674 (closed) and https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9583 would be especially useful in this context). Any improvements around the deployment strategy, observability, and performance could also be addressed during this stage. If the registry is not operating normally and we need to rollback (if done on a shared/existing environment), we can disable the database feature flag, and the registry should continue to operate normally with only the blob storage backend metadata (gitlab-org/container-registry#168 (closed)).
Once we're happy with the above, we can think about the next step, which will likely involve importing all of the registry metadata from the filesystem into the database, either in a one-go or a phased approach. The registry would then be started and operate exclusively with the metadata on the database, using the storage backend to write blobs only. The low-level details can be discussed and fine-tuned later on once we have real insights from stage one. For reference, a draft of a possible plan, including these two stages, is detailed in gitlab-org/container-registry#165 (closed). We expect to revisit this later on and ask for your contribution.
Once the registry is reliably operating based on the metadata database only, we can then start testing a new and faster offline garbage collector (feasible for small and medium registries) and an online garbage collector (for large registries).

Container Registry database migrations and deployment strategy

Context

Implementation Details

Purpose

Relevant Information