Optimize the garbage collector sweep stage for GCS for Container Registry garbage collection
Problem to solve
For organizations that build and publish many Docker images to their Gitlab Container Registry, it is vital that they are able to easily and efficiently delete old, unused images from storage. The problem is that the garbage collection algorithm is inefficient and can take a long time to run.
There are two key stages to the process, the mark
stage, which identifies which images/tags can be deleted, and the sweep
stagee, which deletes them. We recently optimized both of these stages for S3 and saw significant improvements. We need to do the same for Google cloud storage, so that we can unblock our customers utilizing GCS from running garbage collection and lowering their cost of storage.
Intended users
Further details
GCS Bulk Delete
- We intended to update GCS to take advantage of bulk delete by adding support for batch operations in the upstream GCS client library (by taking advantage of the JSON API batch support) is not feasible at the moment.
- Upon further investigation, we found that the Google team wants to evaluate and eventually rollout batch support for all official client libraries in a consistent way and likely at the same time.
- As an example, here is the feature request for the Go and Ruby client libraries:
- Both of these point to the same centralized issue on Google's issue tracker: https://issuetracker.google.com/issues/142641783
- Therefore we won't be able to take advantage of bulk delete requests for now. Our primary focus will be leveraging on concurrency. The results won't be as outstanding as for S3 but I still expect them to be great.
Current Behavior
The GC sweep stage starts by looking for a list of manifests and blobs that should be deleted. The manifests list is built during the mark stage (recursive path walk), and the blob list is made similarly, but during the sweep stage.
With the list of manifests and blobs to delete, the GC uses storage.Vacuum
as an intermediate for the storage driver, removing objects from the backend.
Proposal
Identify and implement performance optimizations for the sweep
stage of the garbage collection algorithm for GCS.
- Since gitlab.com utilizes GCS for Container Registry storage, we can run performance tests and benchmarks on dev.gitlab.com. This will help inform how we can scale to the production version of gitlab.com
Permissions and Security
- There are no permissions changes needed for this issue.
Documentation
- There are no documentation changes needed for this issue.
Availability & Testing
What does success look like, and how can we measure that?
Success feels like
- We see similar performance gains to the optimizations we made for S3, in which we saw GC (for 15k blobs) go from 2 hours to 93 seconds.
- We enable our customers utilizing large amounts of storage on GCS to run GC and greatly reduce their storage costs.
- We utilize our learnings to better understand how to reduce storage costs for GitLab.com
Metrics
- #38052 breaks out the metrics we would like to track for understanding usage and adoption of garbage collection
What is the type of buyer?
- This problem impacts our larger customers most, as they typically have many teams building many images.