Skip to content

Add new cleanups tags service

David Fernandez requested to merge 367729-new-service into master

🐆 Context

Container Registry data lives on two sides: the rails backend and the Container Registry backend. At its core, it's only two "objects":

  • Repositories. These live on the rails (under the ContainerRepository model) and the Container Registry side.
  • Tags. One repository can host many tags. These only live in the Container Registry.

This means that anything related with tags, the rails backend needs to ping the Container Registry (using the provided API). One example relevant for this MR are cleanup policies which is a way to automatically remove stale or unwanted tags (as these take space on Object Storage). Those policies are executed by background workers that currently will pretty much hammer the Container Registry API to gather information on tags. Among other things, policies need the creation timestamp of the tag to order them. The available API returns one timestamp per tag. Now, imagine what happens when the policy has to go through an repository that has 60K or 100K tags? Yeah, 💥 that many network requests to the Container Registry.

We knew that to improve the situation, we needed an API that could return a set of tags with their creation timestamps in a single call (eg. a paginated API). This evolution was gated on the Container Registry side by a data migration. That migration involves re organizing object metadata. Among other things, have this metadata stored in a database which it's easier to query.

The migration has been ongoing since January and repositories (and their tags) have been "moved" to the new code path on the Container Registry. This allowed the Container Registry to come up with a new tags API that is exactly what's needed here: it returns the list of tags of a given repository in a paginated way. Among other things, the creation timestamp is returned. See container-registry#708 (closed) and the related API documentation here: https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/api.md#list-repository-tags.

This unlocked the work for &8379 (closed), which is the epic to improve cleanup policies execution by using that new API endpoint.

That epic has the following implementation plan:

  1. Support the new tags API endpoint in the existing container registry client. That's !94065 (merged).
  2. Add the new cleanup service that will that new tags API endpoint 👈 You are here. That's issue #367729 (closed).
  3. Create a switch with a feature flag so that the proper service is selected according to the repository migration status and the feature flag.
  4. Incrementally deploy the change and monitor the impact on cleanup policies.

In this MR, we're going to add a new cleanup tags service. Externally, it will behave exactly the same as the existing service but internally, things will be different. The goal of the service is simple. It receives a container repository and cleanup rules. Out of these, it has to build a list of tags to delete and call the delete service to actually execute the tags destruction. This goal will not change with this MR.

As I said, the implementation of the service will differ from the existing one:

  1. Use the new tags API
    • That's the core of this improvement: in one single network call, we get a list (page) of tags and their timestamps. That's enough to apply the cleanup rules.
  2. We will not truncate any received list.
  3. We will not cache any of the received timestamps.
  4. We will support an execution timeout as we don't want to hit a problem if the considered repository as many, many pages of tags.

One thing that is crucial here is to keep the rules accuracy. We don't want case where a given tag is deleted if we use the old service and not if we use the new one.

The only exception to this is if we're going through multiple pages. The new service will walk through pages of tags. How do we apply the cleanup rules there? We could walk through all the available pages and then apply the rules. That could work but there is a risk: the execution is limited (on purpose) and getting all pages can take time (we have a champion repository on gitlab.com with more than 80 pages). In this MR, we choose to do things differently: instead of trying to apply rules globally (on the entire list of tags), we apply them locally (on single pages). In short, we loop on pages and for each page we apply the cleanup rules.

Why can we do that? Well, cleanup policies have a nice trait: they are executed several times per week. If a tag is not deleted on a given execution, that's fine, we will remove it on the next execution. We already use this approach (local rules execution) given that we have some repositories with a massive list of tags. With time, we will monitor the tags list size encountered in cleanup policies and perhaps at some point, we can switch to: let's load the entire list of tags and clean it up.

Lastly, this new service will be "dead code" for now as it will not be called by any part of the code base. It's only with step (3.) that we will create a "facade" that the current callers will use. That facade will then switch on the correct service depending on the conditions, among other things, we will have a feature flag support there.

🔬 What does this MR do and why?

  • Add app/services/projects/container_repository/gitlab/cleanup_tags_service.rb.
  • Put in common code shared by the new service and app/services/projects/container_repository/cleanup_tags_service.rb.
  • Update the lib/container_registry/tag.rb class so that an updated_at field is supported.
  • Create/Update the related specs.

This change has no changelog as it is not connected to anything.

🖥 Screenshots or screen recordings

n / a

How to set up and validate locally

Testing this MR is a bit challenging as we need a Container Registry running with the metadata database support. Still, it's not impossible, so let's get started.

  1. Have a GDK ready with the Container Registry support.
  2. Follow these instructions to setup a Container Registry with the new API support.
  3. Create a new project.
  4. To push repositories with many tags, you can use https://gitlab.com/nmezzopera/container-factory or you can manually push several tags to a given repository.
    • Try to push 10 tags in 2 repositories.
  5. Check in the UI (<project url>/container_registry), that you created images with the right amount of tags.

We are now ready to test this. We can play with the rules on the timestamps (as all tags have just been created) but we can still play with the keep_n parameter.

  1. In a rails console, let's put the last container repository as "imported" (this is required so that the new tags API is used):
    ContainerRepository.last.update!(migration_state: :import_done)
  2. Create the cleanup service:
    service = ::Projects::ContainerRepository::Gitlab::CleanupTagsService.new(ContainerRepository.last, User.first, { 'name_regex' => '.*', 'keep_n' => 1 })
    This will remove all tags except the most recent one.
  3. Let's execute the cleanup:
    service.execute
    # => {:original_size=>10, :before_delete_size=>9, :deleted_size=>9, :deleted=>["latest-11902", "beta-30943", "latest-9015", "next-29282", "alpha-30954", "beta-6502", "latest-19104", "alpha-4814", "alpha-18195"], :status=>:success}
  4. Check the UI that there is now only one tag.
  5. Check the registry logs that /gitlab/v1/repositories/<project_path>/tags/list/?n=1000 has been accessed. That's the new API endpoint.

We can see in the service result that 9 tags have been selected for destruction and destroyed.

We did so with a single network call to build the list of tags to deleted (instead of 11).

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports