Skip to content

Bulk delete manifests during garbage collection (S3)

Context

This change is part of !24 and related to #10 .

Like for blob deletion, which we improved in !31 (merged), the current manifest deletion process for the S3 storage backend is very inefficient.

As with blobs, manifest revisions and tag references are stored in a single file. While for blobs this file is named data, for manifest revision and tag references it's named link.

Sample manifest revision full path (hex digest truncated):

docker/registry/v2/repositories/myorg/myproj/myimg/_manifests/revisions/sha256/82d...057/link

Sample manifest tag reference full path:

docker/registry/v2/repositories/myorg/myproj/myimg/_manifests/tags/index/sha256/82d...057/link

You can read more about the path specification here.

Current Behaviour

In the current implementation, the storage.driver.s3.Delete method is invoked with the path to the parent "folder" of the manifest revision and tag reference(s) (.../82d...057). This leads to a ListObjects request on that path, which only contains a single object, the link file, followed by a DeleteObjects request to delete that file. Such is true for both manifest revisions and tag references, as they're currently deleted in separate Delete calls.

The storage.driver.s3.Delete method is the same one used for blob deletion (before !31 (merged)). It's a generic method designed to support both direct (single file) and recursive (multiple files) deletion, which is why it always lists the objects in a given path before actually deleting them.

Considering the above, for each manifest to delete, the garbage collector (GC) always sends two sequential HTTP requests to S3, one ListObjects followed by a DeleteObjects to delete a manifest revision. That's 2M network requests, where M is the number of manifests to delete.

On top of that, if there are tag references related to a given manifest, another three sequential HTTP requests are sent to S3, one ListObjects, to check if the tag reference exists (stat), followed by a Delete call, which as described above, involves one ListObjects and one DeleteObjects. That's a total of 3TM requests for each manifest to delete, where T is the number of tag references for a given manifest.

The worst-case scenario can, therefore, be described as 2M + 3TM requests per manifest to delete, which is very inefficient.

Rational

Most of the principles described here were applied to blob deletion as well in !31 (merged). These changes also rely on the concurrency feature introduced in !32 (merged).

  • Considering that manifest revisions and tag references are stored in a single file, named link, there is no need to list the contents of its parent folder before deleting it. We can delete the link file directly, cutting the required network requests in half, from 2N to N, where N is the number of files to delete (either manifest revision or tag references).

  • All blob storage backends (S3, GCS, Azure, OSS and Swift) behave in the same way: empty "folders" are automatically deleted. Therefore, it's enough to remove the link file; there is no need to worry about its parent folder. This is actually how the existing Delete method of each storage driver works (S3 example); they list and delete the files within a path and do nothing about the path itself. However, the same is not valid for the local filesystem storage backend (see Caveats).

  • We're not making use of the bulk delete feature for the S3 DeleteObjects operation. This operation supports up to 1000 objects per request. Doing so, we would be able to go from N to ceil(N/1000) requests, which is a massive improvement.

  • Additionally to the above, we can leverage on goroutines and spawn DeleteRequests concurrently for every 1000 link files to delete.

  • Instead of deleting the tag reference and revision link files in separete requests, we can remove them in the same DeleteObjects request. With this change, instead of M*T delete requests we can do just M+T.

  • Last but not least, we can probably ignore the stat check (a ListObjects request) performed over each tag reference before attempting to delete it. For S3, if we try to delete a file that doesn't exist, the request will succeed anyway. However, this is not the case for all supported storage backends (trying to remove a non-existing file may raise an error). Regardless, I'm unable to see why we need to stat a tag reference before trying to delete it, this is not done for manifest revisions or blobs files, only for tag references. If we do ignore this operation, the worst thing that can happen is that any attempt to delete a non-existing tag reference will result in an error (not with S3 but with others) and that error will halt the GC. So it might be worth a try, at least during testing against a real repository. For now we'll leave the stat call as-is. Further investigation is going to be performed on #16 (closed).

Expectations

Number of Requests

The current solution requires up to 2M + 3TM requests to delete M manifests with T tag references each. As an example, to delete 100 manifests with 2 tags each, we need 2*100 + 3*2*100 = 800 requests.

Theoretically, with the changes proposed in this MR, we can achieve a worst-case scenario of ceil(MT/1000) if we ignore the stat call or ceil(MT/1000) + MT if we keep it. Therefore, for our example scenario, we would only need ceil(100*2/1000) = 1 or ceil(100*2/1000) + 100*2 = 201 requests, respectively.

It's important to note that this doesn't only provide better performance. With S3, DeleteObjects and ListObjects requests have a price per unit. So cutting requests is also beneficial in terms of monetary cost.

Performance

We cannot get a performance improvement as big as the reduction in network requests. There is an inherent cost to every bulk delete request. S3 needs to delete all objects before returning a response, so we're "just" (understatement) avoiding the cost of building/parsing, serializing/deserializing and sending/receiving all those individual requests.

However, S3 should be able to parallelize the deletion of the objects within a bulk delete request on the server-side, so a bulk request should always be faster than multiple single sequential requests, even without accounting for the network trip time.

Overall, when compared to the current solution, we expect the performance improvement to be exponential. The gain should grow along with the number of manifests to delete, benefiting from the reduction of the required network requests and concurrency.

Caveat

No benefit comes without a cost. For all supported blob storage backends (S3, GCS, Azure, OSS and Swift) empty "folders" are automatically "deleted" (in reality, they don't exist, they're just an abstraction, a grouping key). Therefore we can delete the link files directly and not worry about leaving an empty folders (.../82d...057) behind.

This has been confirmed by testing against real buckets on all storage backends. A test repository with a manifest and a tag reference was created in every backend. The filesystem tree was captured before and after deleting the manifests and tags. The before vs after can be observed in !39 (6e042464), left side vs right side, respectively. As can be seen, no empty folders were left behind. The same test was performed using the current version of the registry, making sure that the end result is the same.

However, the above is not valid for the local filesystem and the in-memory (for test purposes only) storage backends. For the local filesystem, when the Delete method is invoked with the path of the parent folder for both manifest revisions and tag references (.../_manifests/.../sha256/82d...057), the folder (82d...057) is removed along with all its contents.

Fortunately, I have not yet observed any issues related to the empty folders left behind while using the local filesystem backend (besides having those empty folders around, which take around 4KB in space). Furthermore, in the current implementation, when deleting a blob, although the parent folder of its data file is removed (e.g. docker/registry/v2/blobs/sha256/f1/f13...7e6), the parent folder above it (f11) is not, which means that this caveat is already being observed in a different context.

Regardless, I have raised #14 (closed) to track this and possibly implement a workaround. A viable solution is to delete the parent folder of a given link/data file (if empty) after deleting the file itself. This can be achieved by customizing the new DeleteFiles method for this specific driver, without impacting any other functionality beyond GC. As expected, this will increase the execution time when using this storage backend, but given the massive improvements that we're able to achieve for all blob storage backends, I believe this is an acceptable trade-off.

Test Results

We ran a series of benchmarks to validate the expected improvements, both in terms of the number of network requests and the overall performance.

Methodology

Matrix

The following test matrix was used and allows us to simulate the deletion of M manifests with T tag references each:

# manifests (M) # tags per manifest (T)
1 0
1 1
10 0
10 1
100 0
100 1
100 20

The "100 manifests with 20 tag references each" test is intended to show the advantage of relying on concurrent DeleteObjects requests. As explained before, each DeleteObjects request can include up to 1000 objects to delete. With 100 manifests and 20 tag references each, we have precisely 2000 link files to remove, which, leveraging on !32 (merged), should spawn two goroutines, one per DeleteObjects request.

Scenarios

  • Current implementation, using 34dc7cbe (tip of the release-2.7-gitlab branch at the time of writing);
  • This MR, keeping the stat call for every manifest tag reference;
  • This MR, without the stat call (after !39 (d60b54cf)).

Notes

  • For each execution we recorded the number of requests and the elapsed time (ns/op);
  • For each manifest, the benchmark creates and uploads a random test image (~40MB) to S3. Therefore we decided to keep the number of manifests short, with a maximum of 100 (4GB in total);
  • Each benchmark was executed exactly once, due to the constraint described above.

Setup

Benchmarks ran against a Standard S3 bucket on the eu-central-1 region. The test registry was kept in a local development machine. The measured latency between the local machine and the AWS region was 48ms.

Benchmarks were executed using the following command:

go test -v github.com/docker/distribution/registry/storage/driver/s3-aws -args -check.b -check.f "DriverSuite.BenchmarkRemoveManifest*" -gocheck.btime 1ns

Results

Number of Requests

# of manifests (M) # of tags per manifest (T) Current version (# of requests) This MR w/stat (# of requests) This MR w/stat (# times less requests) This MR w/o stat (# of requests) This MR w/o stat (# times less requests)
1 0 2 1 2x 1 2x
1 1 5 2 3x 1 5x
10 0 20 1 20x 1 20x
10 1 50 2 25x 1 50x
100 0 200 1 200x 1 200x
100 1 500 2 250x 1 500x
100 20 6200 22 282x 2 3100x

As observed in the table above, the reduction in the number of required network requests is interesting. The results are aligned with the worst-case scenario of 2M + 3TM requests for the current behaviour, and Ceil(MT/1000) (ignoring the stat call) or ceil(MT/1000) + MT (keeping the stat call) for the new proposed behaviours.

We should keep in mind that every S3 request (except DELETE requests, which we don't use, as DeleteObjects are POST) comes with a monetary cost.

Performance

# of manifests (M) # of tags per manifest (T) Current version (ns/op) This MR w/stat (ns/op) This MR w/stat (# times faster) This MR w/o stat (ns/op) This MR w/o stat (# times faster)
1 0 606153574 574414880 1x 569338177 1x
1 1 966410198 613353498 2x 582084552 2x
10 0 3666298828 1466329110 3x 1439528993 3x
10 1 8657222064 2207152786 4x 1760289573 5x
100 0 28150195388 1019283291 28x 1017087339 28x
100 1 58692186549 9271314923 6x 1270475980 46x
100 20 665373618970 125509286474 5x 3318149515 201x

These results show that for our test scenario, the run time can be up to 201 times faster than the current implementation. The improvement ratio should continue to develop as M and T grow.

As expected, the two possible implementations are similar when T=0 and they diverge as T grows. It's also clear the benefit of concurrent requests.

Conclusions

The benchmarks seem to confirm the expectations regarding the number of requests and the overall speedup. As expected, nothing comes without a cost. The foreseeable caveats have been discussed and can be addressed to mitigate concerns.

Note: Further testing against a real large repository is required to validate the feasibility and effectiveness of this change. None of the approaches and results highlighted here are guaranteed until then. The final details will be described later in !24 (merged).

Next Steps

  • Investigate the possibility of avoiding the stat operation for each tag (#16 (closed)).
  • Implement a workaround for the empty folders left behind on the local filesystem storage driver (#14 (closed)).

These will be addressed in separate MRs.

Edited by João Pereira

Merge request reports