Add limited capacity job to destroy container repositories (!101946) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 217702-limited-capacity-job into master Oct 24, 2022

🗄 Context

Users can host container repositories in their Projects using the GitLab Container Registry.

The modeling can be simplified with:

flowchart LR
  p(Project)--1:n--- cr(Container Repository)
  cr--1:n--- t(tag)

Easy, right? Well, we have a few challenges (simplified):

ContainerRepository data is hosted on the rails backend and the container registry.
Tag on the other hand, only exists in the container registry.

When we read a container repository on the rails side, we can't know in advance how many tags we have there. To know that, we need to call the container registry API to have the list of tags.

Now, let's say that a user clicks on the destroy button of a container repository on the rails side. We have a few things to do to complete this operation (simplified):

Delete all tags.
- We need to call one DELETE endpoint per tag here as the container registry API doesn't have a delete tags in bulk endpoint (yet).
Delete the container repository.
- We have to call one DELETE endpoint in the container registry API.
- We have to remove the related row from the database.

The above is quite involved, so this operation is delayed to a background worker.

The current worker (DeleteContainerRepository) will simply walk through steps (1.) and (2.).

Now, on gitlab.com we have some heavy container repositories (with close to 100 000 tags). That step (1.) will certainly take time. On top of that, (1.) is doing many network requests (recall that DELETE request per tag) to the container registry that can fail due to restarts, hiccups or other. As such, (1.) have some good chances to fail.

The problem with that is that the current implementation is ignoring some some of those failures and still executing (2.) 😱. This is not great as it leaves some 👻 tags in the container registry (eg. we don't have the related container repository in the rails side anymore).

Another problem is that the worker could be terminated due to a long running job and will never retried the delete operations. Container repositories will be marked as "pending destruction" in the UI as we have a status field on the container repository to indicate if a repository is being deleted or not.

In very short words, (1.) is not reliable and causes quite a few issues. This is issue #217702 (closed).

🚑 Limited capacity jobs to the rescue!

The main idea to tackle those problems is to have a job that can be interrupted, killed, stopped, whatever. It doesn't matter much, the delete operation will be resumed.

To implement that, we're going to leverage a limited capacity job. It's responsibility will be quite simple:

Take the next pending destruction container repository, exits if none.
Loop on tags and delete them while limiting the execution time.
If (2.) succeeds, destroy the container repository.
Re-enqueue itself (this is automatically done as part of the limited capacity worker).

Now, (2.) can be stopped or interrupted. That's fine. As long as we keep the container repository as pending destruction, the delete operation will be resumed at a later time.

In other words, this job will loop non stop until all pending destruction container repositories are processed (eg. removed).

That's nice and cool but how do we kick start the loop?

This will be done with a cron job.

The beauty of this approach is that any web request deleting a container repository doesn't have to enqueue any worker. Marking the container repository as pending destruction is enough. The two jobs will guarantee that it will be picked up for processing.

✂ MRs split

The entire change was a bit too big for my taste in a single MR. So, I splitted the work in several MRs:

The limited capacity job and database changes. 👈 You're here.
The cron job and the feature flag.
feature flag cleanup along with the old approach of destroying container repositories.

🔬 What does this MR do and why?

database changes
- Add a new column delete_started_at to table container_repositories
Model changes
- Add a new status delete_ongoing to ContainerRepository. This is used to make sure that 2 limited capacity jobs don't pick up the same container repository.
- Add helper functions to ContainerRepository to start and reset the delete phase.
Background jobs
- Add the ContainerRegistry::DeleteContainerRepositoryWorker job which pick ups the next delete_scheduled container repository and start removing it.

📺 Screenshots or screen recordings

n / a

⚙ How to set up and validate locally

Have GDK ready with the container registry setup.
Setup your $ docker client.

Time to create a container repository with a few tags. Create a Dockerfile with:

`Dockerfile`

FROM scratch
ADD seed /

Now, let's create a container repository with many tags:

for i in {1..100}
do
 docker build -t <registry url>/<project-path>/registry-test:$i .
 docker push <registry url>/<project-path>/registry-test:$i
done

I used 100 for the amount of tags.

Everything is ready to play around. In a rails console:

Get the container repository:
```
repo = ContainerRepository.last
```

Check that we have many tags:

repo.tags_count # should be the amount of tags you created

We need to categorize our repository as non migrated:
```
repo.update!(created_at: ::ContainerRepository::MIGRATION_PHASE_1_ENDED_AT - 1.month)
```
- Basically, the container registry of GDK doesn't handle migrated repositories (yet). As such, we need to make sure that rails treat this as a non migrated repository during the tags cleanup. (migrated repos with the gitlab container registry use a way more efficient way to delete tags).
Let's mark the repository as delete_scheduled:
```
repo.delete_scheduled!
```

Now, let's enqueue our limited capacity job:

ContainerRegistry::DeleteContainerRepositoryWorker.perform_with_capacity

In log/sidekiq.log, you should see these lines:

{"severity":"INFO","time":"2022-10-25T14:18:50.803Z","retry":0,"queue":"default","backtrace":true,"version":0,"status_expiration":1800,"queue_namespace":"container_repository_delete","class":"ContainerRegistry::DeleteContainerRepositoryWorker","args":[],"jid":"d6585226d7c2127474cc418b","created_at":"2022-10-25T14:18:50.771Z","meta.feature_category":"container_registry","correlation_id":"2bb2bfee97d503f932b200d1232b936f","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2022-10-25T14:18:50.800Z","job_size_bytes":2,"pid":86294,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-d6585226d7c2127474cc418b: start","job_status":"start","scheduling_latency_s":0.002467}


{"severity":"INFO","time":"2022-10-25T14:18:55.963Z","project_id":303,"container_repository_id":136,"container_repository_path":"root/registry-refacto/test2","tags_size_before_delete":99,"deleted_tags_size":99,"meta.feature_category":"container_registry","correlation_id":"2bb2bfee97d503f932b200d1232b936f","meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","class":"ContainerRegistry::DeleteContainerRepositoryWorker","job_status":"running","queue":"default","jid":"d6585226d7c2127474cc418b","retry":0}   


{"severity":"INFO","time":"2022-10-25T14:18:55.976Z","retry":0,"queue":"default","backtrace":true,"version":0,"status_expiration":1800,"queue_namespace":"container_repository_delete","class":"ContainerRegistry::DeleteContainerRepositoryWorker","args":[],"jid":"d6585226d7c2127474cc418b","created_at":"2022-10-25T14:18:50.771Z","meta.feature_category":"container_registry","correlation_id":"2bb2bfee97d503f932b200d1232b936f","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2022-10-25T14:18:50.800Z","job_size_bytes":2,"pid":86294,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-d6585226d7c2127474cc418b: done: 5.17304 sec","job_status":"done","scheduling_latency_s":0.002467,"redis_calls":9,"redis_duration_s":0.0023899999999999998,"redis_read_bytes":211,"redis_write_bytes":1393,"redis_cache_calls":1,"redis_cache_duration_s":0.000233,"redis_cache_read_bytes":202,"redis_cache_write_bytes":55,"redis_queues_calls":4,"redis_queues_duration_s":0.001411,"redis_queues_read_bytes":5,"redis_queues_write_bytes":799,"redis_shared_state_calls":4,"redis_shared_state_duration_s":0.000746,"redis_shared_state_read_bytes":4,"redis_shared_state_write_bytes":539,"db_count":9,"db_write_count":5,"db_cached_count":1,"db_replica_count":0,"db_primary_count":9,"db_main_count":9,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":1,"db_main_cached_count":1,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.013,"db_main_duration_s":0.013,"db_main_replica_duration_s":0.0,"cpu_s":0.185157,"worker_id":"sidekiq_0","rate_limiting_gates":[],"duration_s":5.17304,"completed_at":"2022-10-25T14:18:55.976Z","load_balancing_strategy":"primary","db_duration_s":0.00403}

If we check the UI, the container repository is gone 🎉

🏎 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

💾 Database review

⤴ Migration up

$ rails db:migrate
main: == 20221020124018 AddDeleteStartedAtToContainerRepositories: migrating ========
main: -- add_column(:container_repositories, :delete_started_at, :datetime_with_timezone, {:null=>true, :default=>nil})
main:    -> 0.0051s
main: == 20221020124018 AddDeleteStartedAtToContainerRepositories: migrated (0.0056s) 

main: == 20221025105205 AddStatusAndIdIndexToContainerRepositories: migrating =======
main: -- transaction_open?()
main:    -> 0.0000s
main: -- index_exists?(:container_repositories, [:status, :id], {:name=>"index_container_repositories_on_status_and_id", :where=>"status IS NOT NULL", :algorithm=>:concurrently})
main:    -> 0.0117s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0003s
main: -- add_index(:container_repositories, [:status, :id], {:name=>"index_container_repositories_on_status_and_id", :where=>"status IS NOT NULL", :algorithm=>:concurrently})
main:    -> 0.0038s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: == 20221025105205 AddStatusAndIdIndexToContainerRepositories: migrated (0.0231s)

⤵ Migration down

$ rails db:rollback
main: == 20221020124018 AddDeleteStartedAtToContainerRepositories: reverting ========
main: -- remove_column(:container_repositories, :delete_started_at, :datetime_with_timezone, {:null=>true, :default=>nil})
main:    -> 0.0164s
main: == 20221020124018 AddDeleteStartedAtToContainerRepositories: reverted (0.0213s) 

main: == 20221025105205 AddStatusAndIdIndexToContainerRepositories: reverting =======
main: -- transaction_open?()
main:    -> 0.0000s
main: -- index_exists?(:container_repositories, [:status, :id], {:name=>"index_container_repositories_on_status_and_id", :algorithm=>:concurrently})
main:    -> 0.0133s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0004s
main: -- remove_index(:container_repositories, {:name=>"index_container_repositories_on_status_and_id", :algorithm=>:concurrently, :column=>[:status, :id]})
main:    -> 0.0110s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: == 20221025105205 AddStatusAndIdIndexToContainerRepositories: reverted (0.0333s)

🚜 Queries Analysis

We do have single row updates but that are usual container_repository.update_columns calls. I didn't run an analysis on these queries. Those are standard UPDATE queries for a single row selected by primary key.

Edited Oct 25, 2022 by David Fernandez

Add limited capacity job to destroy container repositories