You need to sign in or sign up before continuing.

Proposal for continuous, on-demand online garbage collection

Context

Currently, the container registry relies on an offline mark and sweep garbage collection (GC) algorithm. To run it, the registry needs to be either shutdown or set to read-only, remaining like that during the whole GC run.

In the registry storage backend, image configurations, layers and manifests are stored as blobs, deduplicated, and shared across repositories. These are then linked (like a symlink) to each repository that relies on them, giving such repositories permission to pull those blobs from the central storage location.

During the mark phase, the registry analyzes all repositories, creating a list of configurations, layers and manifests that are referenced/linked in each one of them. The registry will then list all existing configurations, layers and manifests (stored centrally) and obtain a list of those that are not referenced/linked in any repository. This is the list of blobs eligible for deletion.

With the output from the mark phase in hand, the registry starts the sweep stage, where it'll loop over all blobs identified as eligible for deletion and delete them from the storage backend, one by one.

Doing this for a huge registry may require many hours/days to complete, time during which the registry must remain in read-only mode. This is not feasible for platforms with tight availability requirements, such as GitLab.com.

To overcome this, we started an effort to migrate the registry metadata (the list of blobs, repositories, and which manifest/layers are referenced/linked in each one of them) from the storage backend into an SQL database (&2313 (closed)). By leveraging on an SQL database we can dramatically reduce the time needed to enumerate all existing artifacts and detect those eligible for deletion (beyond other benefits like being able to extract metrics, major performance improvements, etc.).

The goal for #224 (closed) is to leverage on the metadata database to enable online GC, without needing to switch the registry to read-only mode.

This issue details a specification for an online GC algorithm that adopts a different approach to GC, departing from the classic mark and sweep.

Assumptions

The described online GC approach is only applicable to new repositories or existing repositories that can be gradually migrated to an instance backed by the metadata database, using a gradual migration strategy.

Rationale

Mark and sweep drawbacks

With the mark and sweep approach, the registry operates during a given period of time during which it eventually becomes dirty. After a given period of time, GC kicks in (either manually or automated) to clean any artifacts eligible for deletion. To do so, the registry has to find out what happened in all repositories since the last GC run. Doing this exercise has some clear implications:

The bigger the registry gets, the higher the time and space requirements are to determine what happened since the last run and what is now eligible for deletion (mark phase);
As the mark phase run time increases, so do the chances of false positives and race conditions if we need to keep the registry in read/write mode.

As an example, let's consider that it takes 5 minutes (fairly optimistic) to run the mark phase on the metadata database for a huge (petabytes in size) registry. Let's also consider that the outcome of it identifies layer L as eligible for deletion. If someone uploads an image I that also relies on layer L during that 5-minute window, the outcome of the mark phase is no longer valid. We cannot delete L, otherwise image I will become corrupted.

To mitigate this, we would either have to:
- Lock the database (entire tables, as we need to consider all artifacts and not just one or a few) during the mark phase, which is by itself a threat to availability. The bigger the registry gets the longer it takes;
- Rerun the mark phase if there were changes during the previous run. This can become an endless loop;
- Develop and operate a mechanism to record and conciliate the outcome of the mark phase with the changes that might have occurred while it was running. This becomes highly complex to operate and debug.
We have to repeat the whole process if any failure occurs during one of the stages;
The time and resources required to run a mark and sweep algorithm targeting all repositories and artifacts might not only become inconvenient in terms of performance but also a waste in terms of efficiency. We have no idea if the registry has any garbage to be collected or not until the mark completes. Sometimes the time and resources spent might have not been worth it (nothing or almost nothing to delete) and sometimes it might find so many artifacts to delete that it'll cause a prolonged spike in the database and the storage backend workload.

Instead of struggling to clean, strive to keep it clean

This is the whole idea behind the proposed approach. Instead of letting the registry become dirty and attempting to clean it periodically from end to end, we could simply not let it become dirty.

We could keep it clean continuously and on-demand, acting only when we know something might have become eligible for deletion. To achieve this we need to act upon all API requests that can lead to garbage being left in the storage backend.

For example, if someone deletes a manifest from a repository through the API, we should go and find out if there are any other repositories that need that specific manifest (not all existing manifests), otherwise we should wipe it from the storage backend and keep the registry clean.

The cleanup tasks should be processed asynchronously, avoiding any impact on the API response time and ensuring that no tasks are lost, that failed tasks can be retried (thus we never fail to clean, at least not without noticing) and that we can instrument, monitor and report their activity.

Benefits

GC only happens when we know that something might have become eligible for deletion;
The registry is kept clean, with lower fluctuations in storage space, thus optimizing costs;
Instead of analyzing all artifacts, we can focus on a single artifact at a time, dividing to conquer and ensuring that GC scales efficiently as the registry size increases;
As we continuously spawn GC tasks for individual artifacts, the load becomes diluted over time, posing fewer threats to availability and performance;

As an example, if someone deleted manifest A from repository R, we just need to find out if any other repository is using manifest A, we don't need to be concerned with all other manifests as well.
By narrowing down the target artifacts to one, we can drastically reduce the time and space requirements to determine eligibility for deletion. This also reduces the exposure to false positives and race conditions to the bare minimum. If it only takes a few milliseconds to determine if a specific artifact is eligible for deletion, it's extremely unlikely that a conflicting write request targeting that same artifact would have time to arrive and be persisted during that period of time;
In case an individual cleanup task fails, it'll not affect the remaining, we don't have to cancel or rerun them. Tasks are kept isolated by targeting different individual artifacts.

Drawbacks and Mitigation Strategies

Although very unlikely (at least for the GitLab.com registry, as it has a fairly low write rate), it's not impossible to have a race condition between identifying an artifact as eligible for deletion and deleting its blob from the storage backend.

Fortunately, by narrowing down to a single artifact we can lock specific records on the database (instead of entire tables) to avoid such race conditions.
A GC task is needed for every request that might generate garbage, so there will be a fair number of tasks that will not result in any deletion. There is not much we can do about this, it's the main trade off. The worker pool must scale accordingly;
As already mentioned, this algorithm is only applicable to new repositories or those that can be migrated to a fresh instance backed by the metadata database. We intend to provide tools and migration strategies for registries of all sizes to make this possible;
Although unlikely, it's not impossible for some cleanup tasks to fail, which left unattended, after a lengthy amount of time may lead to a reasonable amount of garbage left behind;

This can be attenuated by ensuring that we have a solid job scheduler, to never lose track of GC tasks, even in case of repeated failures. In case an application bug is blocking a set of GC tasks, it should be possible to stage these for a later batch retry once the bug is fixed.

In the end, as the process improves, the number of missing or failed and unrecoverable GC tasks should tend to zero.
Making the cleanup tasks asynchronous will require us to build a scheduling process internally, relying on the database. This is a fair price for the added benefits, though.

Relevant API Requests

Only a few of the operations exposed by the container registry API may lead to dangling artifacts that can be garbage collected. Furthermore, we only need to act upon requests that end successfully.

The table below lists the relevant API requests, the prerequisites that must be met to create a GC task for that request (besides finishing successfully), the artifacts that might become eligible for deletion, and the conditions that must be met to determine them as eligible for deletion.

Name	API route	Prerequisites	What may become eligible for deletion	Conditions to be eligible for deletion
Deleting a tag	`DELETE /v2/<name>/tags/reference/<reference>`	N.A.	Manifest: The blob of the manifest that the tag points to. Layers: The blob of the layers referenced by the manifest that the tag points to. Configuration: The blob of the configuration referenced by the manifest that the tag points to.	Manifest: If no other tag in any repository points to that same shared manifest. Layers: If no tagged manifest in any repository also references those same shared layers. Configuration: If no tagged manifest in any repository also references the same configuration.
Deleting a manifest	`DELETE /v2/<name>/manifests/<reference>`	N.A.	Manifest: The manifest blob. Layers: The blob of the layers referenced by the manifest. Configuration: The blob of the configuration referenced by the manifest.	Manifest: If no tag in any repository points to that same shared manifest. Layers: If no tagged manifest in any repository also references those same shared layers. Configuration: If no tagged manifest in any repository also references the same configuration.
Uploading a manifest	`PUT /v2/<name>/manifests/<reference>`	1.`reference` is a tag, not a digest; 2. A tag with name `reference` already exists in repository `name`; 3. The tag is currently pointing to a manifest that is not the one being uploaded. It was pointing to manifest `A` and now it'll point to manifest `B`.	Manifest: The blob of manifest `A`. Layers: The blob of the layers referenced by manifest `A`. Configuration: The blob of the configuration referenced by manifest `A`.	Manifest: If no tag in any repository points to the shared manifest `A`. Layers: If no other tagged manifest in any repository also references the same shared layers referenced by manifest `A`. Configuration: If no other tagged manifest in any repository also references the same configuration referenced by manifest `A`.

Execution Flow

The diagram below details the execution flow for GC, in the presence of one of the requests listed above.

sequenceDiagram
	participant C as Client
	participant R as Registry
	participant D as Database
	participant B as Blob Storage Backend
	participant J as Job Scheduler
	C->>R: HTTP request that<br/>might generate garbage.<br/>Targets artifact X.
	activate R
	Note right of R: Fullfill HTTP request
	R->>D: Query and persist changes
	opt If uploading a manifest
		R-->>B: Upload
	end
	opt Request successfully processed
		R-xJ: Schedule GC task for artifact X
	end
	R->>C: HTTP response
	deactivate R
	J->>R: GC task for artifact X
	activate R
	Note right of R: Process GC task
	R->>D: Artifact X eligible for deletion?
	D->>R: Yes/No
	alt Yes
		R->>D: Lock artifact X
		R->>B: Delete blob of artifact X
		alt Success
			R->>J: ACK
			J->>J: Task complete
		else Failure
			R->>J: NACK
			alt Within retry limit
				J->>J: Requeue for automatic retry
			else Retry limit exhausted
				J->>J: Set aside for manual retry
			end
		end
	else No
		R->>J: ACK
		J->>J: Task complete
	end
	R->>D: Unlock artifact X
	deactivate R

Edited Aug 24, 2021 by João Pereira