Improved performance for the GitLab Container Registry garbage collection algorithm for S3
Problem to solve
The GitLab Container Registry allows developers to build, push and share Docker images/tags using the Docker client and/or GitLab CI/CD.
For organizations that build many images across many projects, it is important to regularly remove old, unused images and tags. The container registry garbage collection process takes a long time to run and is inefficient. This makes it impossible for instances with large amounts of storage usage to run the process and results in expensive, inefficient usage of storage.
- Self-managed customers that schedule down time for the registry and run garbage collection, but not for more than 4 hours.
Optimize the container registry garbage collection process to run faster and enable our customers to clean-up their registry and lower the cost of storage.
After an initial investigation, we learned that the path to optimization will likely be different for each storage system. So, we will focus on our initial effort on S3, then GCS and beyond.
- I as an administrator, when I am running garbage collection to clean out old images/tags from the container registry which contains more than 500 GB of storage, need to be able to run garbage collection in less than four hours, so that I can optimize storage without sacrificing productivity of my organization's engineering teams.
- Optimizing the code will serve as a stepping stone to delivering in-line garbage collection and allow us to unblock some of our customers.
- Our large self-managed customers will be able to lower their cost of storage and improve the discoverability of their container registry.
- Small to mid-sized self-managed customers will be able to run garbage collection with less down time and a less disruptive process.
Optimization vs. enabling on-line garbage collection
- The garbage collection process currently requires the container registry being set to read-only mode or being shut down.
- On-line garbage collection, without requiring down time, is the direction we would like/need to go.
- However, now that we have forked Docker Registry it makes sense to first optimize the existing code, which baed on some early feedback is rather inefficient. This should help unblock some of our larger customers from running garbage collection.
- Then we will enable online garbage collection, which will allow it to run in the background and prevent downtime entirely. But as that work requires more effort, we are prioritizing that second.
Permissions and Security
- As with the existing garbage collection command, this should be limited to administrators only.
What does success look like, and how can we measure that?
- Success looks like we see a 50% improvement in how long the process takes for our customers. We can measure this by working with the several organizations that have requested this optimization and evaluating if it met their needs.
- Track number of times the command is run and how long the process took each time.