Caching in cleanup policy background jobs

⛲ Context

With time, the container registry accumulates container image tags. Those tags take physical space on object storage.

To counter that, we implemented cleanup policies. Basically, users will create a set of filters that will allow the backend to distinguish between a tag that needs to be kept and a tag that can be destroyed.

Cleanup policies are executed by background jobs periodically. There is a cadence value that users can set but to keep simple, let's say that backend will run the policies daily.

Now, container image tags don't live in the rails backend database. Those objects live directly in the container registry. So, when the backend runs a policy against a container image, it has to contact the container registry (through an API) and get all the information/data about tags.

Let's detail, these interactions (simplified)

sequenceDiagram
    autonumber
    rails cleanup tags service->>rails cleanup tags service: Run the cleanup policy on this container image X
    rails cleanup tags service->>container registry: Give me all the tags of container image X
    container registry->>rails cleanup tags service: Array of tag names
    rails cleanup tags service->>rails cleanup tags service: Apply filters F1 on the array of tag names
    loop **For each tag name**
      rails cleanup tags service->>container registry: get the created at timestamp
      container registry->>rails cleanup tags service: return the created at timestamp
    end
    rails cleanup tags service->>rails cleanup tags service: Apply filters F2 on tags with created at
    rails cleanup tags service->>rails delete tags service: hey delete these tags!
    rails delete tags service->>rails cleanup tags service: ok

F1: this set of filters is set on the tag names (such as a regex)
F2: these need the created_at as they are filters that need to work on the tag list ordered by created_at

You can see in the interactions above that F2 triggers a loop where the backend will trigger one API call per tag to get the created_at of a given tag. Please understand, that these are the constraints we need to work with. At some point, container registry updates will unlock evolutions such as returning the list of tags with their name and their created_at field in a single API call. Until we have those updates, we need to work with two API endpoints: one to get the list of tags (only names) and one to get the created_at of a tag.

To complete this context, understand that container image tags are mutable. Users can delete them and make them point to a different image. The constraint here is that they do this by interacting directly with the container registry. As such, the rails backend has no clue of what happened with a tag.

💥 Problem

The main issue is that F2 removes so many tags that the resulting list of tags (eg. list of tags to destroy) is empty.

See this Kibana dashboard. We're not going into the details of that dashboard but look at how much Deleting tags we have. The related ratio with that section is super small. In other words, in the majority of the jobs execution, we do all these API calls to the container registry for..... 🥁 .... nothing (empty list of tags to destroy).

This is not efficient at all.

Here is the p95 of external_http_count of the last 24 hours for jobs that didn't destroy any tag.

🔨 Solution

The proposed solution here aims to reduce the loop of pings to get created_ats. This will make the background job more efficient and so the backend will make a better use of its background resources.

The idea is to have a cache for some tags (not all of them). We need to be extra cautious with what we cache because as presented in the context, tags are mutable objects and the rails backend is not notified when a tag is "updated". In other words, caching a tag can be dangerous because we easily have stale data in cache.

Let's see if we can workaround those limitations. Here are the updated steps around F2 in the cleanup_tags_service.rb:

Receives the result of F1 (array of tag names)
Read the older_than parameter from the cleanup policy
Remove from the cache, all entries older than older_than
For each tag, check the cache and fill the created_at
For each tag without the created_at, ping the container registry
Apply F2 filters
For each tag filtered out, create a new cache entry for the tag with their created_at if a cache entry doesn't exist

I will emphasize here that cache entries are only created for tags that are kept or filtered out by F2. Why? Because, there is no sense in creating a cache entry for a tag that is going to be destroyed.

A few things to note:

Cache entries have a TTL. I choose to use the cleanup policy older_than parameter but we could chose something shorter such as older_than / 2.
The job itself will not evict cache entries
Optional: the job could log the cache hit ratio

Why this custom cache logic works with the mutable nature of tags. Because, we snapshot the created_at field for tags that we want to keep and when it's time to destroy the tag, we don't destroy the tag itself but the cache entry. This is to "force" the job to ping the container registry to get the latest created_at = the job destroys tags with information that comes only from the container registry.

By avoiding using the stale data, the job will support any tag mutation. Here is an example (assuming older_than = 90 days):

T: the tag t is created.
T + 1.day: the job runs and caches T for tag t.
T + 10.days: the tag is destroyed.
T + 20.days: the tag is recreated (same name).
T + 91.days
- the job runs and remove the cache entry for t (T is older that older_than.).
- the job pings the container registry for t and receives T + 20.days.
- the job filter out t (T + 20.days is within older_than).
- the job creates a new cache entry with T + 20.days for tag t.

The tag mutation didn't lead the job to a wrong decision. The cache was properly updated with the new created_at.

💡 Technical details ideas

The cache itself can be implemented in two ways:

redis
regular database table

It has to support the following operations:

Create an entry with: container image id, tag name, created_at
Get the entry for a given container image id and tag name
Given a container image id, remove all the cache entries where created_at is < Time.now - policy.older_than

Here are some numbers from gitlab.com:

~50K container images are processed by cleanup policies
- keep in mind that we're rolling out the cleanup policies for all projects and this number will only increase
before applying F1, we truncate the list of tags to max 200
so, the worst case scenario will be handling 50K * 200 cache entries

Redis feels better geared at this problem than a database. Operation (3.) can be implemented with a TTL in the redis key directly = the job doesn't have to evict the cache entries, redis will do automatically that for us.

The amount of data to be cached can be a concern. As such, I would suggest using a feature flag to "mark" projects (or namespaces) that can use this cache feature.

⚙ Implementation

Following #339129 (comment 660056458), redis is a better candidate for what we want to do.

Step (3.) There is no need for this step with redis. Using SET, we can set an expiration time. In other words, redis itself will handle this part for us.
Step (4.) To get the value from the cache, we can use GET. Nothing special here.
Step (7.) To set a value, we can use SET as described above.
- Regarding the TTL, it should be created_at + older_than

The key for SET and GET should be: container_repository:{<id>}:cleanup_tag:<tag name>. The container image id and the tag name must be part of the key. The value is, well, the created_at value 😺

⚠ Gotchas

Avoid n+1 redis operations since we're looping on a list of tags.
- For this use redis pipelines.

Edited Sep 13, 2021 by David Fernandez