Investigate and identify gaps for Rollout of Container Registry for Self-Managed Instances
Problem statement
As discussed and reviewed in the internal doc, there are some potential gaps have been identified. This issue is to continue to investigate and identify any gaps with the updated generally available definition.
The outcome of this investigation could also be a good foundation for GitLab Infrastructure Platforms Review Process (&17136 - closed).
Acceptance Criteria
-
Pair with ~"team::GitLab Delivery Framework" and other required stakeholders to review Operational Readiness Review - Container Regist... (container-registry#1537) -
Identify self managed scope of gaps and estimate the effort -
Identify required stakeholders and engage them to identify any gaps and estimate the effort as well -
No timeline estimation is required
Reference
- https://docs.google.com/document/d/1XyELGbjN75EfGepZyQlGH2wMC_evbDDvH2MM_Luwp7k/edit?tab=t.v47bj47rd5jo#heading=h.1zsmvh9j1ljr
- https://docs.google.com/document/d/1nw4UVouGZd_HxFIKf-YZDQ1crm5jJwLZgrv5EQ1-3w0/edit?tab=t.0
Stakeholders
-
groupdurability:
@ibaum -
groupgeo:
@mkozono -
groupframework:
@nwestbury,@grantyoung -
grouppackage registry:
@hswimelar -
groupSelf Managed:
@Alexand
Closing summary
Updated 2025-05-27
I believe we've reviewed all the missing gaps from the Distribution side, and we've agreed on what we would like to recommend as next steps.
Which kind of Database to support (separate logical DB vs separate instance)
For now, we decided to support only a separate logical DB as part of the GitLab managed database. I.e., the internal database provisioned by Omnibus. Any usage of an external database has to be managed the user.
Gaps
I've split this by code group ownership. The package team can collaborate with these groups stakeholders to decide whether they'll self-serve and collaborate by working on missing gaps, or whether the owning groups will take the work.
Durability
We need to provide the same level of backup/restore support that we have for our legacy registry users.
Issue: Backup and Restore is not considering the Conta... (#532507)
Geo
We evaluated that it needs further refinement on the current design, and that even if we have the support the current design, we don't want to for non-geo users to have to deploy a separate PostgreSQL server, much less a separate HA cluster. Ideally, we'd find a solution for Geo that does not require a separate instance. But this is a complex topic that requires further refinement, collaboration, and agreement, cross-functionally.
Issue: Evaluate Registry DB hard requirement on separa... (#535290)
Framework
Add support to GET.
Issue: https://gitlab.com/gitlab-org/gitlab-environment-toolkit/-/issues/877+
Package Registry
- Possibly simplifying the import progress.
- Issue: TBD
- Possibly improving visibility of the ongoing import process.
Self Managed
GitLab Chart
- Needs to provision database, user and extensions for our built-in PostgreSQL (to aid our test patters, not production usage)
- Needs to update the documentation about external PostgreSQL requirements
Notes
We agreed that the chart won't support automatic post-deployment migrations.
Omnibus
We need to provide multiple database support, specially for GitLab-managed database. This architectural blue print was never fully implemented for our databases. I believe we were getting around it because the databases that we have, other than the main database, were either never required for the basic non-Geo, single node installation. Or they were not supported for external usage, like the CI decomposition work.
It's unprescedent, and the registry metadata will be the first separate logical database that GitLab Omnibus will provide for the simpler standard installation with a GitLab managed database. Therefore, there's some considerable amount of work to do.
Therefore, considering that architectural design, we need to provide database support up until level 4 to the registry database. Which means working on:
- Enable the Container Registry Metadata Database... (omnibus-gitlab#8900)
- Create a Default Database for the Container Reg... (omnibus-gitlab#8818 - closed)
- Automatic Container Registry Database Migration... (omnibus-gitlab#8670 - closed)
- Support static configuration of PgBouncer regis... (omnibus-gitlab#9082 - closed)
- (Epic) Support generating configuration for HA (Level ... (&17966 - closed)
- Similartly to charts, we need to make sure migrations talk directly to the master node and bypass pgbouncer.
- we can make use of registry
database.primaryconfig for that.
- we can make use of registry
- Needs to update the documentation about external PostgreSQL requirements
- Issue: omnibus-gitlab#9074 (closed)
Operator v1
Relies on external PostgreSQL DB setup and embedded GitLab chart behaviours. Expected to be a no-op.
Operator v2
Relies on external PostgreSQL DB setup. Expected to be a no-op.
- Has to consider implementation of automatic post-deployment migrations for the registry.
- Has to also consider that migration jobs talk to master while deployment talks to PgBouncer on HA setup.
- Issue for both above: TBD
What we recommend the grouppackage registry to consider
This is kept here for historical purposes. The grouppackage registry team has considered those, see the discussion thread here: #525473 (comment 2491553949)
- Instrumentation metrics for all installation methods to helps evaluate, with a high level of confidence, how many users:
- Are in the legacy metadata.
- Have completed the import to the new metadata.
- Are in the middle of an import.
- Update: Already provided see #525473 (comment 2493272203)
- GitLab Admin UI which provides the user with capabilities to:
- Start/retry an import.
- Check the status of an import, and the status of their registry in general.
- Know which metadata system is currently enabled.
- Confirm it's capable of handling unexpected Backup/Restore scenarios.
- Don't rely on the
registry.database.enabledconfiguration solely to define if a user is using the registry DB or not. For the backup/restore scenarios, this wono't be enough. The container registry is the best component to identify where best to put the users on. Make sure the registry can track this state after backtup/restore.
- Don't rely on the
- If the registry does not detect the new database is the one being used, add a baner in the UI telling users to migrate.
- Confirm that we have functional and performance testing for the registry metadata DB, or also create issues to implement it. Not only unit tests, which I'm sure we have, but also tests with deployed packages.
These would be critical once we decide to enable the registry by default for the installation methods. Enabling just for fresh installs is not a straight-forward thing to achieve exactly because of Backup/Restore patterns. The installation methods only have knowledge to the install/upgrade configuration files, but after they finish running, we can't guarantee that those Backup/Restore patterns won't bring the database state into conflict to what was defined in the installation/upgrade configuration files. At this point, our best hope is that the registry component is capable of handling this inconsistencies, and that it has a very good UI to report problems back to users. Preferrably, not only via logs, but via UI, like suggested above.
Recommended rollout stratagy
I believe the best strategy to Rollout would be:
- Make sure all the above issues are addressed.
- With the above addressed, the database objects should be already created by default, for new and existing user that upgrade, which facilitates user adoption.
- Migration steps should also be easier and with better user experience.
- User will be more confident in migrating if they have a better feedback on the migration process.
- Keep promoting users to migrate, as the process gets facilitated.
- Watch the metrics.
- If, at some point, the metrics point to low number of users in the legacy method, simply enable the registry DB on by default.
- If the metrics point to considerable number of users in the legacy method:
- Consider giving them more time to migrate.
- Or, consider enabling it by default, but know that this will either:
- Bring them to a state that the registry needs to know how to gracefully handle a forced migration. Perhaps, even implement a forced automatic migration mechanism.
- Might mean a breaking change that needs to be considered with leadership.