Investigation: Container Registry Long Term Maintainability
Context
The container registry is a large, complex project supporting not only numerous modes of operation and two APIs, but also several services and tools.
The tokei
tool reports a total of 109,422 lines of Go for this project, including test files.
Not only is this a large project in terms of code volume, this project is forked from up the upstream distribution project (formerly Docker distribution) — one of the oldest large Go projects. Therefore, we have quite a lot of code that was written for previous versions of Go, which lacked some of the modern features we enjoy today and before the current Go coding conventions were established. Not only this, the code itself is often quite complex, and frequently deals with the subtleties of the docker ecosystem, of which there are many.
Given this, and the relatively small number of engineers working full-time on the container registry, we are at risk of impacts to development velocity, speed of onboarding new engineers, as well as bus factors. We've already begun to encounter issues requiring reduplication of effort or deep refactorings:
- Fixing the handling of large configuration payloads in the importer and the API
- The notification system is deeply coupled to object storage metadata interfaces that do not have counterparts to the database metadata handing
Strategies for Reducing Maintenance Burden
GitLab.com
Usage Patterns
Unifying Self-Managed and Now that the Container Registry Migration is complete, the GitLab.com
deployment of the container registry no longer uses object storage metadata at all.
Self-Managed continues to use object storage metadata, which can perform differently across storage drivers (and across different storage backends which target the same driver).
Rolling out the metadata database to self-managed would go a long way to narrow the expertise required to support the registry.
Not only would this make self-managed install more similar to the registry on GitLab.com
, it would also reduce the difference between different self-managed installs as all their metadata operations would be handled by the same code and all use Postgres as a backend.
While a 100% adoption of the metadata database on self-managed is likely years away, we'll still gain incremental benefits as more and more users move to the new system.
Refactoring Opportunities
Distribution V2 API
The registry currently supports 5 different modes of operation for this API:
- Database Metadata Only
- Object Storage Metadata Only
- Mirroring (Writing metadata to both the database and object storage, reading from the database)
- Migration Mode
- Pull through Cache
Large portions of the logic that differentiates these modes are directly exposed in the handlers, we've already done something of an emergency refactor in the manifest handler as the complexity there had become overwhelming, even for engineers already familiar with the code.
Ideally, the handler code would not have knowledge of how information is being stored or retrieved, but presently logic in individual handlers and tests for handlers interacts with these functions on a relatively deep level. The refactor above, while it kept code in the handler, took a step in the direction of creating interfaces which remove the logic for database metadata, object storage metadata, and mirroring from the handler function directly. We could approach all handler functions in this way, using separate implementations for each mode. We have good test coverage in this area, which would support this refactor with a high level of confidence.
Related to this is removing code (discussed below). Good refactoring makes it easier to remove code; however, removing code both informs the specifics of refactoring while at the same time reducing the value that refactoring provides. We'll need to strike a balance between removal and refactoring, but in general, code that can be removed sooner and is more likely to be removed in general should take priority over refactoring efforts.
Even if we do aggressive removals for these modes of operation, it's worth also considering that sections of the code that are likely going to remain in place for several years will also benefit from a refactor here. In particular, the importer must use both database and object storage metadata and is likely to be one of the last of the listed candidates for removal to be removed, if it is removed at all.
Code Areas that are Candidates for Removal
The code in these following sections represents functionality that we can potentially drop to reduce the code maintenance burden. Each section describes the functionality briefly, its importance, and the relative gain from removing the code. Sections are listed in rough order of how easily it is to remove this code, in terms of both what functionality we would lose.
An estimate of the number of non-test lines removed is given in each section, along with a branch where the changes can be seen. These branches focus mostly on non-test code and are untested — they are intended as a demonstration only. Some of these features are intertwined, especially later in the list, therefore the estimate is based on a version of the registry will all previous removals already done.
Inventory Tool
The Inventory Tool was essentially a one-off, allowing us to do a shallow analysis on the GitLab.com
registry in preparation for the migration.
This code is pretty self-contained, but it's no longer adding value remaining on the project.
We'd be able to remove ~280 lines of code and one external dependency by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/inventory-tool
Context Logger
The context logger is already currently being removed and replaced with the new implementation, but it's worth mentioning for completeness.
Pull-Through Cache
The pull-through cache allows users to run a local registry as a pull-through cache to an upstream registry. While this functionality is extremely useful, we have not made significant changes here, and our own blog uses the upstream version of the container registry. We don't have anything special to bring to the table for this functionality, and we can work with upstream to introduce features and fix bugs that affect us. Removing this allows us to also remove the client code, which similarly exists on upstream. Fundamentally, we can lose the maintenance burden of this code without sacrificing functionality.
We'd be able to remove ~7400 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/proxy
Filesystem Mirroring
This feature allows for simultaneous writes to object storage metadata while the metadata database is in operation. Originally, this allowed us a path to capture metadata in a known format before we had built confidence in the new (at the time) database metadata subsystem. This feature does not contribute greatly to the complexity of the code; however, there are considerable drawbacks to continuing to support this feature:
- it must be turned off under the
migration
stanza and is enabled by default - the potential errors that can occur when writing metadata increased dramatically
- introduces a performance impact on writes
- undermines the database as the SSOT for registry metadata
We'd be able to remove ~23 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/mirroring
Migration Mode
This feature allows the registry to split the API requests between old and new code paths, allowing for a gradual migration, repository by repository.
While this feature was instrumental in the GitLab.com
migration, it required numerous fixes and a certain amount of direct intervention along the way by people with expert knowledge of the registry.
It's unlikely that even the largest self-managed installs will reach the size that would make a one-shot import impossible, and we now have increased confidence in both the metadata database and the online garbage collector, so the effort of running migration mode is less compelling.
Before committing to removing this, we should ensure that the read-only portion of a self-managed migration is within a reasonable timeframe for self-managed installs in the several hundreds of terrabytes range. We may also wish to include this in a refactoring effort and then remove it, so that we have a better chance to cleanly revert the removal if we encounter a self-managed install that cannot have downtime.
We'd be able to remove ~1200 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/migration. Migration mode results in relatively complex behavior, such as the code path routing, so even though the number of lines removed is relatively modest, they represent a fairly large reduction in complexity.
Like the other estimates, test code is not included in this figure since test code tends to have a lower maintaince burderen than application code. Despite this, it's worth considering that we would be able to remove some of the most complex API integration tests.
Offline Garbage Collector
From here on we start to see features which will require more work, buy in from users (migrating to the database), or dropping features.
Once the self-managed rollout is mostly complete, we should consider removing this feature and recommend moving to database metadata with online garbage collection as the supported way to remove images. This feature is an area where the differences between storage drivers has made a large negative impact, particularly S3:
- Serialization Error while performing Garbage Co... (#753 - closed)
- Offline Garbage Collections exhausts all memory... (#657 - closed)
We should be able to develop a self-managed import process that ignores untagged images, and later passes over the common blob storage to allow the online garbage collector to process any dangling blobs, removing the need to perform a offline garbage collection run to reduce the read-only time of a one-shot import.
We've made some worthwhile improvements here that we should merge upstream before considering removal:
- Offline Garbage Collections fail on filesystems... (#742 - closed)
- registry-garbage-collect -m "failed to unmarsha... (#404 - closed)
- registry-garbage-collect --delete-manifests unl... (#149 - closed)
- Added distribution.ErrRepositoryUnknown error handling inside garbagecollect.MarkAndSweep
We'd be able to remove ~700 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/offline-gc
Object Storage Metadata Upload Purger
We'd need to wait until the self-managed migration is essentially complete before attempting removing this and the following features, but these removals move us towards being able to drop much of the core object storage metadata functionality.
The upload purger cleans up partial blob chunks and metadata let behind as part of multipart uploads. These data are scoped both to the repository and to the UUID of the upload, rather than in common blob storage. In the happy path, these are cleaned up as part of the upload, but it's possible for stale data to be left behind. The upload purger essentially audits the outstanding upload data and metadata, ensuring that stale data is removed.
Before removing this we'll need a metadata database based replacement, but this code is both complex and has a number of issues:
- Upload Purging Algorithm Could Result in Unacce... (#216)
- Upload Purger Ignores User-Configured Age Param... (#428)
- Upload Purging Scales Poorly as Number of Regis... (#217)
We'd be able to remove ~270 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/purger
Blob Descriptor Cache
There is an open issue to investigate how useful this functionality is with the database enabled. Presently, we do not allow this feature to be used with the database at all and have not tested these features operating in tandem. Furthermore, this feature would not be compatible with migration mode without a major rework specifically to support migration mode. We should the linked issue before determining what approach to take here.
We'd be able to remove ~1100 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/cache
V2 API with Object Storage Metadata
Removing this code further consolidates the registry functionality and reduces the range of possible registry configurations that we officially support.
The object storage and database metadata code paths are largely separate, so this is most likely the most impactful change in terms of aligning the GitLab.com
registry and self-managed deployments.
We'd be able to remove ~550 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/object-metadata
Importer
While this feature is tremendously powerful in terms of efficiently moving registries using filesystem metadata to the metadata database, retaining it means that we need to keep large portions of the core object storage metadata handling code.
In addition, we would be able to drop the Walk
and WalkParallel
storage driver functions across all storage driver implementations.
These functions tend to be quite complex, especially WalkParallel
, so this is a huge reduction in maintenance burden.
The additionally the List
function is only used in a single place at this point, so we may be able to remove that as well, although there is not an obvious solution and the List
code tends to be less complex.
In terms of maintainability, this feature relies on a fairly particular object storage layout that is only conventionally used across implementations of the container registry. Therefore, it's likely we'll need to provide an officially support path for migration that can work across any two registries that support the V2 API spec.
We'd be able to remove ~2300 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/importer
Blob Uploads Tracking Metadata
This keeps track of multipart upload chunks.
From a maintenance perspective this is the last feature that uses the List
method of the storage drivers, and moving it to the database would allow us to remove all implementations of this method.
We'd be able to remove ~300 lines of code by removing this feature: https://gitlab.com/gitlab-org/container-registry/-/tree/818/list