Investigate and identify gaps for Rollout of Container Registry for Self-Managed Instances

Problem statement

As discussed and reviewed in the internal doc, there are some potential gaps have been identified. This issue is to continue to investigate and identify any gaps with the updated generally available definition.

The outcome of this investigation could also be a good foundation for GitLab Infrastructure Platforms Review Process (&17136 - closed).

Acceptance Criteria

Reference

Stakeholders

Closing summary

Updated 2025-05-27

I believe we've reviewed all the missing gaps from the Distribution side, and we've agreed on what we would like to recommend as next steps.

Which kind of Database to support (separate logical DB vs separate instance)

For now, we decided to support only a separate logical DB as part of the GitLab managed database. I.e., the internal database provisioned by Omnibus. Any usage of an external database has to be managed the user.

Gaps

I've split this by code group ownership. The package team can collaborate with these groups stakeholders to decide whether they'll self-serve and collaborate by working on missing gaps, or whether the owning groups will take the work.

Durability

We need to provide the same level of backup/restore support that we have for our legacy registry users.

Issue: Backup and Restore is not considering the Conta... (#532507)

Geo

We evaluated that it needs further refinement on the current design, and that even if we have the support the current design, we don't want to for non-geo users to have to deploy a separate PostgreSQL server, much less a separate HA cluster. Ideally, we'd find a solution for Geo that does not require a separate instance. But this is a complex topic that requires further refinement, collaboration, and agreement, cross-functionally.

Issue: Evaluate Registry DB hard requirement on separa... (#535290)

Framework

Add support to GET.

Issue: https://gitlab.com/gitlab-org/gitlab-environment-toolkit/-/issues/877+

Package Registry

Self Managed

GitLab Chart

Notes

We agreed that the chart won't support automatic post-deployment migrations.

Omnibus

We need to provide multiple database support, specially for GitLab-managed database. This architectural blue print was never fully implemented for our databases. I believe we were getting around it because the databases that we have, other than the main database, were either never required for the basic non-Geo, single node installation. Or they were not supported for external usage, like the CI decomposition work.

It's unprescedent, and the registry metadata will be the first separate logical database that GitLab Omnibus will provide for the simpler standard installation with a GitLab managed database. Therefore, there's some considerable amount of work to do.

Therefore, considering that architectural design, we need to provide database support up until level 4 to the registry database. Which means working on:

Operator v1

Relies on external PostgreSQL DB setup and embedded GitLab chart behaviours. Expected to be a no-op.

Operator v2

Relies on external PostgreSQL DB setup. Expected to be a no-op.

  • Has to consider implementation of automatic post-deployment migrations for the registry.
  • Has to also consider that migration jobs talk to master while deployment talks to PgBouncer on HA setup.
    • Issue for both above: TBD

What we recommend the grouppackage registry to consider

This is kept here for historical purposes. The grouppackage registry team has considered those, see the discussion thread here: #525473 (comment 2491553949)

  • Instrumentation metrics for all installation methods to helps evaluate, with a high level of confidence, how many users:
    • Are in the legacy metadata.
    • Have completed the import to the new metadata.
    • Are in the middle of an import.
    • Update: Already provided see #525473 (comment 2493272203)
  • GitLab Admin UI which provides the user with capabilities to:
    • Start/retry an import.
    • Check the status of an import, and the status of their registry in general.
    • Know which metadata system is currently enabled.
  • Confirm it's capable of handling unexpected Backup/Restore scenarios.
    • Don't rely on the registry.database.enabled configuration solely to define if a user is using the registry DB or not. For the backup/restore scenarios, this wono't be enough. The container registry is the best component to identify where best to put the users on. Make sure the registry can track this state after backtup/restore.
  • If the registry does not detect the new database is the one being used, add a baner in the UI telling users to migrate.
  • Confirm that we have functional and performance testing for the registry metadata DB, or also create issues to implement it. Not only unit tests, which I'm sure we have, but also tests with deployed packages.

These would be critical once we decide to enable the registry by default for the installation methods. Enabling just for fresh installs is not a straight-forward thing to achieve exactly because of Backup/Restore patterns. The installation methods only have knowledge to the install/upgrade configuration files, but after they finish running, we can't guarantee that those Backup/Restore patterns won't bring the database state into conflict to what was defined in the installation/upgrade configuration files. At this point, our best hope is that the registry component is capable of handling this inconsistencies, and that it has a very good UI to report problems back to users. Preferrably, not only via logs, but via UI, like suggested above.

Recommended rollout stratagy

I believe the best strategy to Rollout would be:

  1. Make sure all the above issues are addressed.
    • With the above addressed, the database objects should be already created by default, for new and existing user that upgrade, which facilitates user adoption.
    • Migration steps should also be easier and with better user experience.
    • User will be more confident in migrating if they have a better feedback on the migration process.
  2. Keep promoting users to migrate, as the process gets facilitated.
  3. Watch the metrics.
    • If, at some point, the metrics point to low number of users in the legacy method, simply enable the registry DB on by default.
    • If the metrics point to considerable number of users in the legacy method:
      • Consider giving them more time to migrate.
      • Or, consider enabling it by default, but know that this will either:
        • Bring them to a state that the registry needs to know how to gracefully handle a forced migration. Perhaps, even implement a forced automatic migration mechanism.
        • Might mean a breaking change that needs to be considered with leadership.
Edited by João Alexandre Cunha