Add cron job for container registry cleanup (!101918) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 217702-refactor-delete-worker-for-container-repositories into master Oct 24, 2022

🗄 Context

Users can host container repositories in their Projects using the GitLab Container Registry.

The modeling can be simplified with:

flowchart LR
  p(Project)--1:n--- cr(Container Repository)
  cr--1:n--- t(tag)

Easy, right? Well, we have a few challenges (simplified):

ContainerRepository data is hosted on the rails backend and the container registry.
Tag on the other hand, only exists in the container registry.

When we read a container repository on the rails side, we can't know in advance how many tags we have there. To know that, we need to call the container registry API to have the list of tags.

Now, let's say that a user clicks on the destroy button of a container repository on the rails side. We have a few things to do to complete this operation (simplified):

Delete all tags.
- We need to call one DELETE endpoint per tag here as the container registry API doesn't have a delete tags in bulk endpoint (yet).
Delete the container repository.
- We have to call one DELETE endpoint in the container registry API.
- We have to remove the related row from the database.

The above is quite involved, so this operation is delayed to a background worker.

The current worker (DeleteContainerRepository) will simply walk through steps (1.) and (2.).

Now, on gitlab.com we have some heavy container repositories (with close to 100 000 tags). That step (1.) will certainly take time. On top of that, (1.) is doing many network requests (recall that DELETE request per tag) to the container registry that can fail due to restarts, hiccups or other. As such, (1.) have some good chances to fail.

The problem with that is that the current implementation is ignoring some some of those failures and still executing (2.) 😱. This is not great as it leaves some 👻 tags in the container registry (eg. we don't have the related container repository in the rails side anymore).

Another problem is that the worker could be terminated due to a long running job and will never retried the delete operations. Container repositories will be marked as "pending destruction" in the UI as we have a status field on the container repository to indicate if a repository is being deleted or not.

In very short words, (1.) is not reliable and causes quite a few issues. This is issue #217702 (closed).

🚑 Limited capacity jobs to the rescue!

The main idea to tackle those problems is to have a job that can be interrupted, killed, stopped, whatever. It doesn't matter much, the delete operation will be resumed.

To implement that, we're going to leverage a limited capacity job. It's responsibility will be quite simple:

Take the next pending destruction container repository, exits if none.
Loop on tags and delete them while limiting the execution time.
If (2.) succeeds, destroy the container repository.
Re-enqueue itself (this is automatically done as part of the limited capacity worker).

Now, (2.) can be stopped or interrupted. That's fine. As long as we keep the container repository as pending destruction, the delete operation will be resumed at a later time.

In other words, this job will loop non stop until all pending destruction container repositories are processed (eg. removed).

That's nice and cool but how do we kick start the loop?

This will be done with a cron job.

The beauty of this approach is that any web request deleting a container repository doesn't have to enqueue any worker. Marking the container repository as pending destruction is enough. The two jobs will guarantee that it will be picked up for processing.

✂ MRs split

The entire change was a bit too big for my taste in a single MR. So, I splitted the work in several MRs:

The limited capacity job and database changes. This is !101946 (merged).
The cron job and the feature flag support. 👈 You are here.
feature flag cleanup along with the old approach of destroying container repositories.

🔍 What does this MR do and why?

Add a ContainerRegistry::CleanupWorker.
- This is a cron job with a schedule of each 5 minutes.
- This will detect any stale ongoing destruction on container repositories and reset them (so that they are retried).
- This will detect any delete_scheduled container repository and enqueue ContainerRegistry::DeleteContainerRepositoryWorker if necessary.
- In addition, it will log counts for the delete_scheduled container repositories and those in stale deletes.
Add a feature flag support.
Update all operations (rails controllers, API endpoints) that destroy a container repository to use the old or new approach (reading the feature flag).
Update/Create all the related specs.

📺 Screenshots or screen recordings

None

⚗ How to set up and validate locally

Have GDK ready with the container registry setup.
Setup your $ docker client.

Time to create a container repository with a few tags. Create a Dockerfile with:

`Dockerfile`

FROM scratch
ADD seed /

Now, let's create a container repository with many tags:

for i in {1..100}
do
 docker build -t <registry url>/<project-path>/registry-test:$i .
 docker push <registry url>/<project-path>/registry-test:$i
done

I used 100 for the amount of tags.

Let's play around with a rails console:

Enable the :

Feature.enable(:container_registry_delete_repository_with_cron_worker)

We need to categorize our repository as non migrated:
```
ContainerRepository.last.update!(created_at: ::ContainerRepository::MIGRATION_PHASE_1_ENDED_AT - 1.month)
```
- Basically, the container registry of GDK doesn't handle migrated repositories (yet). As such, we need to make sure that rails treat this as a non migrated repository during the tags cleanup. (migrated repos with the gitlab container registry use a way more efficient way to delete tags).
Schedule the delete for our container repository
```
ContainerRepository.last.delete_scheduled!
```

Now, let's wait those 5 minutes (or less) given that the cron job is only executed at :00, :05, :10, :15 and so on.

After the waiting enough time, you should see these lines in log/sidekiq.log:

{"severity":"INFO","time":"2022-11-08T08:55:05.159Z","retry":0,"queue":"default","backtrace":true,"version":0,"queue_namespace":"cronjob","args":[],"class":"ContainerRegistry::CleanupWorker","jid":"d9ed45c2016bcb272afb1d23","created_at":"2022-11-08T08:55:05.157Z","meta.caller_id":"Cronjob","correlation_id":"eee945a8c66ea9c19cd415a25c40c1a5","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:default:1c3f6896f5901f854330d52bd75eff7c46411059671aef387767811adc368565","size_limiter":"validated","enqueued_at":"2022-11-08T08:55:05.158Z","job_size_bytes":2,"pid":39085,"message":"ContainerRegistry::CleanupWorker JID-d9ed45c2016bcb272afb1d23: start","job_status":"start","scheduling_latency_s":0.001124}

{"severity":"INFO","time":"2022-11-08T08:55:07.552Z","retry":0,"queue":"default","backtrace":true,"version":0,"queue_namespace":"cronjob","args":[],"class":"ContainerRegistry::CleanupWorker","jid":"d9ed45c2016bcb272afb1d23","created_at":"2022-11-08T08:55:05.157Z","meta.caller_id":"Cronjob","correlation_id":"eee945a8c66ea9c19cd415a25c40c1a5","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:default:1c3f6896f5901f854330d52bd75eff7c46411059671aef387767811adc368565","size_limiter":"validated","enqueued_at":"2022-11-08T08:55:05.158Z","job_size_bytes":2,"pid":39085,"message":"ContainerRegistry::CleanupWorker JID-d9ed45c2016bcb272afb1d23: done: 2.392844 sec","job_status":"done","scheduling_latency_s":0.001124,"redis_calls":10,"redis_duration_s":0.001952,"redis_read_bytes":230,"redis_write_bytes":1811,"redis_cache_calls":1,"redis_cache_duration_s":0.000208,"redis_cache_read_bytes":214,"redis_cache_write_bytes":88,"redis_queues_calls":8,"redis_queues_duration_s":0.001545,"redis_queues_read_bytes":16,"redis_queues_write_bytes":1647,"redis_shared_state_calls":1,"redis_shared_state_duration_s":0.000199,"redis_shared_state_write_bytes":76,"db_count":4,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":4,"db_main_count":4,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.005,"db_main_duration_s":0.005,"db_main_replica_duration_s":0.0,"cpu_s":0.012988,"worker_id":"sidekiq_0","rate_limiting_gates":[],"extra.container_registry_cleanup_worker.delete_scheduled_container_repositories_count":31,"extra.container_registry_cleanup_worker.stale_delete_container_repositories_count":0,"duration_s":2.392844,"completed_at":"2022-11-08T08:55:07.552Z","load_balancing_strategy":"primary","db_duration_s":0.00171}

{"severity":"INFO","time":"2022-11-08T08:55:07.553Z","retry":0,"queue":"default","backtrace":true,"version":0,"status_expiration":1800,"queue_namespace":"container_repository_delete","class":"ContainerRegistry::DeleteContainerRepositoryWorker","args":[],"jid":"be61adbfde0866cc05bc446e","created_at":"2022-11-08T08:55:07.547Z","meta.caller_id":"ContainerRegistry::CleanupWorker","correlation_id":"eee945a8c66ea9c19cd415a25c40c1a5","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2022-11-08T08:55:07.550Z","job_size_bytes":2,"pid":39085,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-be61adbfde0866cc05bc446e: start","job_status":"start","scheduling_latency_s":0.002858}

{"severity":"INFO","time":"2022-11-08T08:55:09.128Z","retry":0,"queue":"default","backtrace":true,"version":0,"status_expiration":1800,"queue_namespace":"container_repository_delete","class":"ContainerRegistry::DeleteContainerRepositoryWorker","args":[],"jid":"be61adbfde0866cc05bc446e","created_at":"2022-11-08T08:55:07.547Z","meta.caller_id":"ContainerRegistry::CleanupWorker","correlation_id":"eee945a8c66ea9c19cd415a25c40c1a5","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2022-11-08T08:55:07.550Z","job_size_bytes":2,"pid":39085,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-be61adbfde0866cc05bc446e: done: 1.575138 sec","job_status":"done","scheduling_latency_s":0.002858,"redis_calls":9,"redis_duration_s":0.002778,"redis_read_bytes":212,"redis_write_bytes":1371,"redis_cache_calls":1,"redis_cache_duration_s":0.00054,"redis_cache_read_bytes":203,"redis_cache_write_bytes":55,"redis_queues_calls":4,"redis_queues_duration_s":0.001469,"redis_queues_read_bytes":5,"redis_queues_write_bytes":777,"redis_shared_state_calls":4,"redis_shared_state_duration_s":0.000769,"redis_shared_state_read_bytes":4,"redis_shared_state_write_bytes":539,"db_count":9,"db_write_count":5,"db_cached_count":1,"db_replica_count":0,"db_primary_count":9,"db_main_count":9,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":1,"db_main_cached_count":1,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.018,"db_main_duration_s":0.018,"db_main_replica_duration_s":0.0,"cpu_s":0.272879,"worker_id":"sidekiq_0","rate_limiting_gates":[],"duration_s":1.575138,"completed_at":"2022-11-08T08:55:09.128Z","load_balancing_strategy":"primary","db_duration_s":0.023361}

the cron job is executed quickly but enqueues the delete worker.
the delete job is executed as usual.
Check on the UI: the container repository is now gone. 🎉

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

💾 Database review

We are introducing a bunch of new queries. In !101946 (merged), we introduced a new index to help those but better to check here if everything is behaving properly:

Edited Nov 08, 2022 by David Fernandez

Add cron job for container registry cleanup