Implement Docker Distribution in-line Garbage Collection (removal of Blobs)

TL;DR this issue describes that we should drop Pruner, and build one of the proposals into Docker Distribution, call it via API from Rails to make it resillient, fast, and easy to use.

1. Problem to solve

Currently, Docker Distrbution does implement Garbage Collection mechanism that requires registry to be put in read-only mode. The currently used solution does not work at all on large scale, the registry consiting of TBs of data.

This issue is to propose a ways to implement that and allow this to work on large scale.

2. Preface

This preface section is to document how Registry stores data and what are the interconnections between elements.

2.1. Registry structure

The same structure is being used for any storage being used by registry.

./docker/registry/v2/blobs/sha256/21/21a899f04608c13a49bfdac34bbf14e7875e2bf6e0b625525761c30ad3ebf79f/data
./docker/registry/v2/blobs/sha256/f1/f12ba4f684b156aad1826feb1f8da69f42ab6e02338c9c19632664a3bf37a70a/data
./docker/registry/v2/repositories/image/_manifests/tags/latest/index/sha256/c3490dcf10ffb6530c1303522a1405dfaf7daecd8f38d3e6a1ba19ea1f8a1751/link
./docker/registry/v2/repositories/image/_manifests/tags/latest/index/sha256/f12ba4f684b156aad1826feb1f8da69f42ab6e02338c9c19632664a3bf37a70a/link
./docker/registry/v2/repositories/image/_manifests/tags/latest/current/link
./docker/registry/v2/repositories/image/_manifests/revisions/sha256/c3490dcf10ffb6530c1303522a1405dfaf7daecd8f38d3e6a1ba19ea1f8a1751/link
./docker/registry/v2/repositories/image/_manifests/revisions/sha256/21a899f04608c13a49bfdac34bbf14e7875e2bf6e0b625525761c30ad3ebf79f/link
./docker/registry/v2/repositories/image/_manifests/revisions/sha256/f12ba4f684b156aad1826feb1f8da69f42ab6e02338c9c19632664a3bf37a70a/link
./docker/registry/v2/repositories/image/_layers/sha256/064c7eacdb01e4925c8a77b74fb95e2bb78782971c9a701c686dcc50979208bf/link
./docker/registry/v2/repositories/image/_layers/sha256/f645750d5c456ace26260b4f1d9596dcdab28324306a41ecb8d8a54988f1018b/link

2.2. Repositories

The repositories consists of _layers, _manifests. The registry allows to access manifest using content-addressable identifiers. For example you can request image via docker pull my.registry.com/image@sha256:c3490dcf10ffb6530c1303522a1405dfaf7daecd8f38d3e6a1ba19ea1f8a1751 as well as using tag. The tag underneath is resolved to the manifest revision.

Each manifest consist of fsLayers. Only layers that are stored within the given repository can be accessed. Layers and manifests are stored as blob in global blob storage. Since each blob is content-addressable the system assumes that there's no hash collision and does not want to prevent that.

2.3. Content-addressable manifests

What is important that revisions stored within repository stays together in it. This is what is being followed today by Docker Hub. Example:

docker build -t my.registry.com/image:latest . # produces sha256:A
docker push my.registry.com/image:latest

# make change and rebuild
docker build -t my.registry.com/image:latest .  # produces sha256:B
docker push my.registry.com/image:latest

You can still access the sha256:A even though it is no longer directly accessible. This is named a content-addressable accessing of manifests. Tag here serves as latest version of manifest.

This is especially useful if you use :latest and do not use semantic versioning of tags. All container scheduler systems (including Kubernetes) when scheduling deployment do convert the tag representation to prefer the content addressable image. This is done as image with tag name is moving target and subject to change.

2.4. Blobs

The registry uses global blob storage which allows to easily share blobs between different repositories. Only blobs attached to a given repository can be accessed in given scope. The system does not allow to access the global blob storage. The hash is generated from blob itself.

2.5. HTTP API

Registry exposes simple REST API. Each API call is scoped to given repository. Example of docker push:

192.168.88.233 - - [18/Feb/2019:12:21:53 +0100] "HEAD /v2/image/blobs/sha256:c97bea55d25911c1b98d8229da491f4a1cab5850147783f334b9f4bdd6e24ccd HTTP/1.1" 404 157 "" "docker/18.09.1 go/go1.10.6 git-commit/4c52b90 kernel/4.20.0-042000-generic os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.1 \\(linux\\))"
192.168.88.233 - - [18/Feb/2019:12:21:54 +0100] "POST /v2/image/blobs/uploads/ HTTP/1.1" 202 0 "" "docker/18.09.1 go/go1.10.6 git-commit/4c52b90 kernel/4.20.0-042000-generic os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.1 \\(linux\\))"
192.168.88.233 - - [18/Feb/2019:12:21:54 +0100] "PATCH /v2/image/blobs/uploads/259a6c1c-f758-444f-8e2c-10e4e37aba12?_state=PNsLJr_GF6JQlORa7tRfckTTBatGlrY8WDIdaS30WVR7Ik5hbWUiOiJpbWFnZSIsIlVVSUQiOiIyNTlhNmMxYy1mNzU4LTQ0NGYtOGUyYy0xMGU0ZTM3YWJhMTIiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMTktMDItMThUMTE6MjE6NTQuMDA3NTYzNDA0WiJ9 HTTP/1.1" 202 0 "" "docker/18.09.1 go/go1.10.6 git-commit/4c52b90 kernel/4.20.0-042000-generic os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.1 \\(linux\\))"
192.168.88.233 - - [18/Feb/2019:12:21:54 +0100] "POST /v2/image/blobs/uploads/?from=root%2Fartifacts-test&mount=sha256%3Af75a3de779f71b63feb01f57dca3c88f6009f8df6de4f6d868439c02615e3d4e HTTP/1.1" 202 0 "" "docker/18.09.1 go/go1.10.6 git-commit/4c52b90 kernel/4.20.0-042000-generic os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.1 \\(linux\\))"

3. Problems

The Docker Distribution how is being build does follow a principle of using directional graph. This works great only if you add data, as you don't have to keep track of all references to be able to reconstruct the graph in other direction. This is big plus as this allow to easily build a highly scalable system using just common generic storage without the state.

The drawback of such system is very hard removal of blobs and recycling the space. Effectively, this requires implementing stop-world garbage collection (what is done today in Docker Distrbution). Of course stop-world can be used for rarely used instances (or instances that can be put in maintanance mode) for short period of time. However, this does not work at all in big-scale systems where system needs to be operational all time.

4. Current work

We did a bunch of work today to aid with Garbage Collection of Docker Distribution. This section briefly describes what is done today.

4.1. Docker Distribution

Docker Distribution exposes registry garbage-collect tool that can be used in maintanance mode.

They also offer the registry garbage-collect -m that removes non-latest revisions of manifests, manifests that are not referenced by tags directly.

4.2. Docker Distrbution Pruner

Is my tool pruner that can be considered a more efficient version of registry garbage-collect -m. More efficient, means 10000x more efficient, but it still follows the same stop-world philosophy and it is not prepared to be run on big scale.

4.3. API support

Recently we introduced an API support to allow cleanup of old tags from GitLab Container tags. This when used with conjuction with registry garbage-collect -m or docker-distirbution-pruner allows to manage the storage usage for small installations (on-premise) that can run in short-term maintance mode of registry (once a week, on weekend, around midnight?).

5. Proposals

This sections describes a set of changes that would allow us to implement efficient blob removal.

The 1. is required to not have to put registry in maintanance mode when trying to perform partial cleanup of registry.

The 2a. or 2b. are potential solutions for efficiently garbage collecting registry.

5.1. Keep track of what is being accessed

The maintanance mode is required, because server does not know what blobs he can remove without affecting clients currently transmitting the data. Registry does not know about latest access time. We could extend registry with storing latest access time (likely in Redis) that would allow us to make Registry aware that this blob (layer or manifest) cannot be yet removed as it was recently accessed. Recently, could be in last 24h. Doing that would allow us to not have to put registry in maintance mode while doing the global or local garbage collection. Since this would use the shared storage, the state would be shared between all registry servers, thus being able to work in HA environment.

5.2a. Divide and conquer

The first idea is to make the problem smaller. Since we know that scaning the global blob storage is very big and time consuming task, and requires putting registry in maintanance mode (not if we implement 1.) we could start uploading blobs local to repository.

We would instead of using /docker/registry/blobs/... would use:

./docker/registry/v2/repostiories/image/_blobs/sha256/21/21a899f04608c13a49bfdac34bbf14e7875e2bf6e0b625525761c30ad3ebf79f/data

Pros:
- All new blobs would be stored local to repository,
- We could confidently get information about the size of repository (quickly),
- We reduce the data set to traverse when executing GC,
- We would have to only slightly adapt the GC command to take as argument the repository so we could perform small GC that would be scheduled by GitLab Rails,
- The backward compatibility for that would be very simple: we would only remove blobs that are stored within repository, we would not touch blobs outside of repository (stored in global blob storage),
- We would introduced HTTP API for garbage collecting the repository
Cons:
- This breaks the sharing of blobs between repositories, it is likely not a big deal for us, as the way how our registry is used is mostly for deployments and not really sharing images,
- Since some of the blobs would be duplicated there will be slight disk space usage for base layers that are common.

5.2b. Build link database

The second idea is to extend the graph in another direction. It means that we would annotate each new pushed blob with information where it is referenced. The same would follow for manifest revisions: we would store where they are used. Since we would have the graph in both directions we could quickly traverse graph without parsing data and rebuilding the graph in memory.

Workflow:

When we would link blob to repository we would create a link in another direction:

create ./docker/registry/v2/blobs/sha256/064c7eacdb01e4925c8a77b74fb95e2bb78782971c9a701c686dcc50979208bf/_repositories/image/link

When we would overwrite the tag we would create or remove manifest revision link and blob storage link:

create ./docker/registry/v2/blobs/sha256/064c7eacdb01e4925c8a77b74fb95e2bb78782971c9a701c686dcc50979208bf/_repositories/image/link
create ./docker/registry/v2/repositories/image/_manifests/revisions/sha256/064c7eacdb01e4925c8a77b74fb95e2bb78782971c9a701c686dcc50979208bf/_tags/latest/link
remove ./docker/registry/v2/repositories/image/_manifests/revisions/sha256/previous-manifest-attached-to-sha/_tags/latest/link

We would then call API call on registry repository to garbage collect it.

Pros:
- We keep storing blobs global, and making the storage usage the most efficient,
- When proper ordering is used we don't need locking,
- Only new global blobs would be considered to be recycled, when pushing new blob we would mark this as blobs/sha256/<sha>/linkable (file),
- We would build a database of references in another direction,
- We would quickly know if the blob can be removed by looking at list of referencing repositories, if we do remove blob, and the list is empty we are free to remove it,
- It is HA safe,
- We would expose API to perform fast garbage collection on repository. Since this garbage collection would be local to the repository and aided with links information we could confidently remove global blobs.
Cons:
- More complex, as we effectively need to inject into number of places (creation of blobs, creation/update of tags) to build the database,
- More prone to buggy implementation, as this works in global storage and is subject to breaking other projects,
- Since not always we would be able to remove blob (empty list, but it was recently accessed by some docker engine) we would have to periodically look at list of all blobs to find ones that do not have any referencing repositories attached,

5.3. Re-implement Registry in GitLab Rails

This builds on monolith of GitLab, but maybe re-implementing all registry functions in GitLab would make the removal of data much simpler, than extending Registry source-code.

It seems that pull-api is fairly straightforward, and was done recently by @dzaporozhets for dependency proxy for container images.

The hardest would be implementing the API for data push to registry. It seems that this is the biggest challenge, but technically, since we fully control data, it would be much easier to implement then:

Geo replication, we would replicate blobs?
Data removal, we would remove blobs that are no longer referenced by tags,
Policies for data removal, this is aligned with all other solutions that we would implement for 5.1 and 5.2,
Security scanning, we would annotate blobs when the scan was done?

I think that this is worth trying out, and see how hard is the upload process.

6. Summary

This is the first iteration of this write-up. I did not consider all possibilities yet, but this should give you a general idea about the challenge of removing data and possible approaches. In any case this requires fork Docker Distribution and adding additional annotations mechanism, and API. Doing local blob storage seems way simpler, and likely is good enough for us. Doing this small GC would be reasonable fast.

Technically, in any case. We could build additional tool that: for case 1. would move blob from global to local storage, or for case 2. would be able to rebuild link database. Doing any of those would allow us to recycle all existing space with moderate impact for production.

Lets consider doing that to 1.:

Iterate each repository,
Read a list of layers and manifests,
Copy object from global storage to local storage,
Turn on feature flag to disable registry accessing global blob storage,
Monitor, if no problems, remove blobs from local storage.

The 2. is slightly harder, but doable.

My ❤ is with 1. Even though it is slightly less efficient of 2. the simplicity of that and making this "local" problem generally allows us to make this easiest to implement. Also the simplicity of building tool to recycle existing storage, this tells me that this is the best and simplest solution to execute. Thirdly, we could run targetted GC only on our repositories! to ensure that it works properly before enabling that for everyone.

7. Links / references

TBD

Edited Jun 13, 2019 by Kamil Trzciński