Optimize the mark stage for GCS for Container Registry garbage collection

Problem to solve

For organizations that build and publish many Docker images to their Gitlab Container Registry, it is vital that they are able to easily and efficiently delete old, unused images from storage. The problem is that the garbage collection algorithm is inefficient and can take a long time to run.

There are two key stages to the process, the mark stage, which identifies which images/tags can be deleted, and the sweep stage, which deletes them. We recently optimized both of these stages for S3 and saw significant improvements. We need to do the same for Google cloud storage, so that we can unblock our customers utilizing GCS from running garbage collection and lowering their cost of storage.

This issue will focus on improving the mark stage.

Intended users

Sidney (Systems Administrator)

Further details

Proposal

Identify and implement performance optimizations for the mark stage of the garbage collection algorithm for GCS.

Since gitlab.com utilizes GCS for Container Registry storage, we can run performance tests and benchmarks on dev.gitlab.com. This will help inform how we can scale to the production version of gitlab.com

Permissions and Security

There are no permissions changes needed for this issue.

Documentation

There are no documentation changes needed for this issue.

Availability & Testing

What does success look like, and how can we measure that?

Success feels like

We see similar performance gains to the optimizations we made for S3, in which we saw GC (for 15k blobs) go from 2 hours to 93 seconds.
We enable our customers utilizing large amounts of storage on GCS to run GC and greatly reduce their storage costs.
We utilize our learnings to better understand how to reduce storage costs for GitLab.com

Metrics

#38052 breaks out the metrics we would like to track for understanding usage and adoption of garbage collection

What is the type of buyer?

This problem impacts our larger customers most, as they typically have many teams building many images.

Links / references

S3 optimization epic

Edited Feb 04, 2020 by Tim Rizzi