Skip to content

Background migration to reset status on container repositories

David Fernandez requested to merge 217702-background-migration into master

🎛 Context

Users can host container repositories in their Projects using the GitLab Container Registry.

The modeling can be simplified with:

flowchart LR
  p(Project)--1:n--- cr(Container Repository)
  cr--1:n--- t(tag)

Easy, right? Well, we have a few challenges (simplified):

  • ContainerRepository data is hosted on the rails backend and the container registry.
  • Tag on the other hand, only exists in the container registry.

When we read a container repository on the rails side, we can't know in advance how many tags we have there. To know that, we need to call the container registry API to have the list of tags.

Now, let's say that a user clicks on the destroy button of a container repository on the rails side. We have a few things to do to complete this operation (simplified):

  1. Delete all tags.
    • We need to call one DELETE endpoint per tag here as the container registry API doesn't have a delete tags in bulk endpoint (yet).
  2. Delete the container repository.
    • We have to call one DELETE endpoint in the container registry API.
    • We have to remove the related row from the database.

The above is quite involved, so this operation is delayed to a background worker.

The current worker (DeleteContainerRepository) will simply walk through steps (1.) and (2.).

Now, on gitlab.com we have some heavy container repositories (with close to 100 000 tags). That step (1.) will certainly take time. On top of that, (1.) is doing many network requests (recall that DELETE request per tag) to the container registry that can fail due to restarts, hiccups or other. As such, (1.) have some good chances to fail.

The problem with that is that the current implementation is ignoring some some of those failures and still executing (2.) 😱. This is not great as it leaves some 👻 tags in the container registry (eg. we don't have the related container repository in the rails side anymore).

Another problem is that the worker could be terminated due to a long running job and will never retried the delete operations. Container repositories will be marked as "pending destruction" in the UI as we have a status field on the container repository to indicate if a repository is being deleted or not.

In very short words, (1.) is not reliable and causes quite a few issues. This is issue #217702 (closed).

See #217702 (closed)

🚛 Data migration

We created new background jobs to more accurately delete container repositories based on the delete_scheduled status. (see !101946 (merged))

Before enabling that job and watch all the delete_scheduled container repositories being removed, we need to take care of a data situation: container repositories can be marked as delete_scheduled (eg. pending destruction) on the rails side but that does mean that users can still continue to push tags to it (directly to the container registry). See #217702 (comment 510937428).

If we delete those container repositories (with all the tags), we could break some user workflows.

To fix this, we discussed the following approach:

  • for each deleted_scheduled repository:
    • check with the container registry if we have tags.
    • if that's the case, reset the status to nil (eg. undelete them).

The above situation can happen on gitlab.com and self-managed. As such, we don't have any choice: we need to use a background migration.

🤔 What does this MR do and why?

  • Introduce a background migration to loop on delete_scheduled container repositories and reset the status of those that have at least one tag.
  • Add the related specs.

📺 Screenshots or screen recordings

None

How to set up and validate locally

1️⃣ Setup

Let's setup some container repositories.

  1. Have GDK ready with the container registry setup.
  2. Setup your $ docker client.

Time to create a container repository with a few tags. Create a Dockerfile with:

`Dockerfile`
FROM scratch
ADD seed /

Now, let's create a some container repositories with some tags:

for i in {1..10}
do
  for j in {1..10}
  do
    docker build -t <registry url>/<project-path>/registry-test$i:$j .
    docker push <registry url>/<project-path>/registry-test$i:$j
  done
done

That creates 10 container repositories with 10 tags each.

Let's empty 3 container repositories (without removing the container repository). In a rails console:

Project.find(<project_id>).container_repositories.sample(3).each(&:delete_tags!)

Make sure you disable the background jobs, otherwise they will eat all delete_scheduled container repositories 😄

Feature.disable(:container_registry_delete_repository_with_cron_worker)

Lastly, let put all container repositories in the delete_scheduled state:

Project.find(<project_id>).container_repositories.each(&:delete_scheduled!)

:gear Running the background migration

Have GDK running and migrate the database:

$ rails db:migrate
main: == 20221123133054 QueueResetStatusOnContainerRepositories: migrating ==========
main: == 20221123133054 QueueResetStatusOnContainerRepositories: migrated (0.0558s) =

The background migration is executed and all the status should be reset back to nil except 3 that will stay in delete_scheduled. In a rails console:

Project.find(<project_id>).container_repositories.map(&:status)
=> [nil, nil, nil, nil, "delete_scheduled", "delete_scheduled", nil, nil, nil, "delete_scheduled"]

Success! 🎉

If we check the UI (http://gdk.test:8000/<project_path>/container_registry), we have this:

Screenshot_2022-11-24_at_22.22.44

(all container repositories have a delete button except 3 on which the button is disabled)

Lastly, if we check the logs, we can see the ones from our migration:

==> log/migrations.log <==
{"severity":"INFO","time":"2022-11-25T14:15:25.965Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":true,"container_repository_id":212,"container_repository_path":"root/bkg-migration/registry-test1"}
{"severity":"INFO","time":"2022-11-25T14:15:26.004Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":true,"container_repository_id":213,"container_repository_path":"root/bkg-migration/registry-test2"}
{"severity":"INFO","time":"2022-11-25T14:15:26.047Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":true,"container_repository_id":215,"container_repository_path":"root/bkg-migration/registry-test4"}
{"severity":"INFO","time":"2022-11-25T14:15:26.078Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":true,"container_repository_id":217,"container_repository_path":"root/bkg-migration/registry-test6"}
{"severity":"INFO","time":"2022-11-25T14:15:26.115Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":false,"container_repository_id":214,"container_repository_path":"root/bkg-migration/registry-test3"}
{"severity":"INFO","time":"2022-11-25T14:15:26.153Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":false,"container_repository_id":218,"container_repository_path":"root/bkg-migration/registry-test7"}
{"severity":"INFO","time":"2022-11-25T14:15:26.187Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":true,"container_repository_id":219,"container_repository_path":"root/bkg-migration/registry-test8"}
{"severity":"INFO","time":"2022-11-25T14:15:26.226Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":true,"container_repository_id":220,"container_repository_path":"root/bkg-migration/registry-test9"}
{"severity":"INFO","time":"2022-11-25T14:15:26.263Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":true,"container_repository_id":221,"container_repository_path":"root/bkg-migration/registry-test10"}
{"severity":"INFO","time":"2022-11-25T14:15:26.300Z","correlation_id":null,"migrator":"ResetStatusOnContainerRepositories","has_tags":false,"container_repository_id":216,"container_repository_path":"root/bkg-migration/registry-test5"}

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

💾 Database review

Database 🤖 report is here: !104858 (comment 1186483338).

Migration up

$ rails db:migration

main: == 20221123133054 QueueResetStatusOnContainerRepositories: migrating ==========
main: == 20221123133054 QueueResetStatusOnContainerRepositories: migrated (0.0558s) =

Migration down

$ rails db:rollback
main: == 20221123133054 QueueResetStatusOnContainerRepositories: reverting ==========
main: == 20221123133054 QueueResetStatusOnContainerRepositories: reverted (0.0248s) =

📊 Query analysis

Background migration execution time analysis for gitlab.com

  • Current amount of container repositories to process: 631.
  • We thus need 13 batches.
  • For each container repository, we will contact the container registry api.
    • Here is the latency p99 of that container registry endpoint. Let's assume that it's 20 ms.
  • Then for each batch we will update the status. Taking the worst case scenario here (updating 50 rows), we will have ~1 sec for that update.
  • Each batch will also trigger additional queries (lower/upper bound and loading all the rows), that's ~81 ms for those operations.

So the total execution time, we have: 12 * 120sec (delay between each batch) + 631 * 0.02s (container registry api) + 13 * 1sec (update) + 13 * 0.08sec (additional queries) = ~1467 secs = ~24.5.

We can safely assume that the background migration should be executed in less than 30 minutes on gitlab.com.

Edited by David Fernandez

Merge request reports