Bulk delete manifests during garbage collection (S3)
Context
This change is part of !24 and related to #10 .
Like for blob deletion, which we improved in !31 (merged), the current manifest deletion process for the S3 storage backend is very inefficient.
As with blobs, manifest revisions and tag references are stored in a single file. While for blobs this file is named data
, for manifest revision and tag references it's named link
.
Sample manifest revision full path (hex digest truncated):
docker/registry/v2/repositories/myorg/myproj/myimg/_manifests/revisions/sha256/82d...057/link
Sample manifest tag reference full path:
docker/registry/v2/repositories/myorg/myproj/myimg/_manifests/tags/index/sha256/82d...057/link
You can read more about the path specification here.
Current Behaviour
In the current implementation, the storage.driver.s3.Delete
method is invoked with the path to the parent "folder" of the manifest revision and tag reference(s) (.../82d...057
). This leads to a ListObjects
request on that path, which only contains a single object, the link
file, followed by a DeleteObjects
request to delete that file. Such is true for both manifest revisions and tag references, as they're currently deleted in separate Delete
calls.
The storage.driver.s3.Delete
method is the same one used for blob deletion (before !31 (merged)). It's a generic method designed to support both direct (single file) and recursive (multiple files) deletion, which is why it always lists the objects in a given path before actually deleting them.
Considering the above, for each manifest to delete, the garbage collector (GC) always sends two sequential HTTP requests to S3, one ListObjects
followed by a DeleteObjects
to delete a manifest revision. That's 2M
network requests, where M
is the number of manifests to delete.
On top of that, if there are tag references related to a given manifest, another three sequential HTTP requests are sent to S3, one ListObjects
, to check if the tag reference exists (stat), followed by a Delete
call, which as described above, involves one ListObjects
and one DeleteObjects
. That's a total of 3TM
requests for each manifest to delete, where T
is the number of tag references for a given manifest.
The worst-case scenario can, therefore, be described as 2M + 3TM
requests per manifest to delete, which is very inefficient.
Rational
Most of the principles described here were applied to blob deletion as well in !31 (merged). These changes also rely on the concurrency feature introduced in !32 (merged).
-
Considering that manifest revisions and tag references are stored in a single file, named
link
, there is no need to list the contents of its parent folder before deleting it. We can delete thelink
file directly, cutting the required network requests in half, from2N
toN
, whereN
is the number of files to delete (either manifest revision or tag references). -
All blob storage backends (S3, GCS, Azure, OSS and Swift) behave in the same way: empty "folders" are automatically deleted. Therefore, it's enough to remove the
link
file; there is no need to worry about its parent folder. This is actually how the existingDelete
method of each storage driver works (S3 example); they list and delete the files within a path and do nothing about the path itself. However, the same is not valid for the local filesystem storage backend (see Caveats). -
We're not making use of the bulk delete feature for the S3
DeleteObjects
operation. This operation supports up to 1000 objects per request. Doing so, we would be able to go fromN
toceil(N/1000)
requests, which is a massive improvement. -
Additionally to the above, we can leverage on goroutines and spawn
DeleteRequests
concurrently for every 1000link
files to delete. -
Instead of deleting the tag reference and revision link files in separete requests, we can remove them in the same
DeleteObjects
request. With this change, instead ofM*T
delete requests we can do justM+T
. -
Last but not least, we can probably ignore the stat check (a
ListObjects
request) performed over each tag reference before attempting to delete it. For S3, if we try to delete a file that doesn't exist, the request will succeed anyway. However, this is not the case for all supported storage backends (trying to remove a non-existing file may raise an error). Regardless, I'm unable to see why we need to stat a tag reference before trying to delete it, this is not done for manifest revisions or blobs files, only for tag references. If we do ignore this operation, the worst thing that can happen is that any attempt to delete a non-existing tag reference will result in an error (not with S3 but with others) and that error will halt the GC. So it might be worth a try, at least during testing against a real repository. For now we'll leave the stat call as-is. Further investigation is going to be performed on #16 (closed).
Expectations
Number of Requests
The current solution requires up to 2M + 3TM
requests to delete M
manifests with T
tag references each. As an example, to delete 100 manifests with 2 tags each, we need 2*100 + 3*2*100 = 800
requests.
Theoretically, with the changes proposed in this MR, we can achieve a worst-case scenario of ceil(MT/1000)
if we ignore the stat call or ceil(MT/1000) + MT
if we keep it. Therefore, for our example scenario, we would only need ceil(100*2/1000) = 1
or ceil(100*2/1000) + 100*2 = 201
requests, respectively.
It's important to note that this doesn't only provide better performance. With S3, DeleteObjects
and ListObjects
requests have a price per unit. So cutting requests is also beneficial in terms of monetary cost.
Performance
We cannot get a performance improvement as big as the reduction in network requests. There is an inherent cost to every bulk delete request. S3 needs to delete all objects before returning a response, so we're "just" (understatement) avoiding the cost of building/parsing, serializing/deserializing and sending/receiving all those individual requests.
However, S3 should be able to parallelize the deletion of the objects within a bulk delete request on the server-side, so a bulk request should always be faster than multiple single sequential requests, even without accounting for the network trip time.
Overall, when compared to the current solution, we expect the performance improvement to be exponential. The gain should grow along with the number of manifests to delete, benefiting from the reduction of the required network requests and concurrency.
Caveat
No benefit comes without a cost. For all supported blob storage backends (S3, GCS, Azure, OSS and Swift) empty "folders" are automatically "deleted" (in reality, they don't exist, they're just an abstraction, a grouping key). Therefore we can delete the link
files directly and not worry about leaving an empty folders (.../82d...057
) behind.
This has been confirmed by testing against real buckets on all storage backends. A test repository with a manifest and a tag reference was created in every backend. The filesystem tree was captured before and after deleting the manifests and tags. The before vs after can be observed in !39 (6e042464), left side vs right side, respectively. As can be seen, no empty folders were left behind. The same test was performed using the current version of the registry, making sure that the end result is the same.
However, the above is not valid for the local filesystem and the in-memory (for test purposes only) storage backends. For the local filesystem, when the Delete
method is invoked with the path of the parent folder for both manifest revisions and tag references (.../_manifests/.../sha256/82d...057
), the folder (82d...057
) is removed along with all its contents.
Fortunately, I have not yet observed any issues related to the empty folders left behind while using the local filesystem backend (besides having those empty folders around, which take around 4KB in space). Furthermore, in the current implementation, when deleting a blob, although the parent folder of its data
file is removed (e.g. docker/registry/v2/blobs/sha256/f1/f13...7e6
), the parent folder above it (f11
) is not, which means that this caveat is already being observed in a different context.
Regardless, I have raised #14 (closed) to track this and possibly implement a workaround. A viable solution is to delete the parent folder of a given link
/data
file (if empty) after deleting the file itself. This can be achieved by customizing the new DeleteFiles
method for this specific driver, without impacting any other functionality beyond GC. As expected, this will increase the execution time when using this storage backend, but given the massive improvements that we're able to achieve for all blob storage backends, I believe this is an acceptable trade-off.
Test Results
We ran a series of benchmarks to validate the expected improvements, both in terms of the number of network requests and the overall performance.
Methodology
Matrix
The following test matrix was used and allows us to simulate the deletion of M
manifests with T
tag references each:
# manifests (M ) |
# tags per manifest (T ) |
---|---|
1 | 0 |
1 | 1 |
10 | 0 |
10 | 1 |
100 | 0 |
100 | 1 |
100 | 20 |
The "100 manifests with 20 tag references each" test is intended to show the advantage of relying on concurrent DeleteObjects
requests. As explained before, each DeleteObjects
request can include up to 1000 objects to delete. With 100 manifests and 20 tag references each, we have precisely 2000 link
files to remove, which, leveraging on !32 (merged), should spawn two goroutines, one per DeleteObjects
request.
Scenarios
- Current implementation, using 34dc7cbe (tip of the
release-2.7-gitlab
branch at the time of writing); - This MR, keeping the stat call for every manifest tag reference;
- This MR, without the stat call (after !39 (d60b54cf)).
Notes
- For each execution we recorded the number of requests and the elapsed time (ns/op);
- For each manifest, the benchmark creates and uploads a random test image (~40MB) to S3. Therefore we decided to keep the number of manifests short, with a maximum of 100 (4GB in total);
- Each benchmark was executed exactly once, due to the constraint described above.
Setup
Benchmarks ran against a Standard S3 bucket on the eu-central-1
region. The test registry was kept in a local development machine. The measured latency between the local machine and the AWS region was 48ms.
Benchmarks were executed using the following command:
go test -v github.com/docker/distribution/registry/storage/driver/s3-aws -args -check.b -check.f "DriverSuite.BenchmarkRemoveManifest*" -gocheck.btime 1ns
Results
Number of Requests
# of manifests (M ) |
# of tags per manifest (T ) |
Current version (# of requests) | This MR w/stat (# of requests) | This MR w/stat (# times less requests) | This MR w/o stat (# of requests) | This MR w/o stat (# times less requests) |
---|---|---|---|---|---|---|
1 | 0 | 2 | 1 | 2x | 1 | 2x |
1 | 1 | 5 | 2 | 3x | 1 | 5x |
10 | 0 | 20 | 1 | 20x | 1 | 20x |
10 | 1 | 50 | 2 | 25x | 1 | 50x |
100 | 0 | 200 | 1 | 200x | 1 | 200x |
100 | 1 | 500 | 2 | 250x | 1 | 500x |
100 | 20 | 6200 | 22 | 282x | 2 | 3100x |
As observed in the table above, the reduction in the number of required network requests is interesting. The results are aligned with the worst-case scenario of 2M + 3TM
requests for the current behaviour, and Ceil(MT/1000)
(ignoring the stat call) or ceil(MT/1000) + MT
(keeping the stat call) for the new proposed behaviours.
We should keep in mind that every S3 request (except DELETE
requests, which we don't use, as DeleteObjects
are POST
) comes with a monetary cost.
Performance
# of manifests (M ) |
# of tags per manifest (T ) |
Current version (ns/op) | This MR w/stat (ns/op) | This MR w/stat (# times faster) | This MR w/o stat (ns/op) | This MR w/o stat (# times faster) |
---|---|---|---|---|---|---|
1 | 0 | 606153574 | 574414880 | 1x | 569338177 | 1x |
1 | 1 | 966410198 | 613353498 | 2x | 582084552 | 2x |
10 | 0 | 3666298828 | 1466329110 | 3x | 1439528993 | 3x |
10 | 1 | 8657222064 | 2207152786 | 4x | 1760289573 | 5x |
100 | 0 | 28150195388 | 1019283291 | 28x | 1017087339 | 28x |
100 | 1 | 58692186549 | 9271314923 | 6x | 1270475980 | 46x |
100 | 20 | 665373618970 | 125509286474 | 5x | 3318149515 | 201x |
These results show that for our test scenario, the run time can be up to 201 times faster than the current implementation. The improvement ratio should continue to develop as M
and T
grow.
As expected, the two possible implementations are similar when T=0
and they diverge as T
grows. It's also clear the benefit of concurrent requests.
Conclusions
The benchmarks seem to confirm the expectations regarding the number of requests and the overall speedup. As expected, nothing comes without a cost. The foreseeable caveats have been discussed and can be addressed to mitigate concerns.
Note: Further testing against a real large repository is required to validate the feasibility and effectiveness of this change. None of the approaches and results highlighted here are guaranteed until then. The final details will be described later in !24 (merged).
Next Steps
- Investigate the possibility of avoiding the stat operation for each tag (#16 (closed)).
- Implement a workaround for the empty folders left behind on the local filesystem storage driver (#14 (closed)).
These will be addressed in separate MRs.