Skip to content

Add a loop mode to the Enqueuer worker

David Fernandez requested to merge 10io-enqueuer-inner-loop into master

🥝 Context

We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend. For all the nitty-gritty details, see &7316 (comment 897867569).

At the core of the rails logic lies the Enqueuer worker. Its job is to find the next image repository and start the migration on the Container Registry side.

The current implementation is really simple:

  1. find the next repository available for the migration.
  2. start the migration on it.

Due to using a "global" exclusive lease key and a high concurrency, we're seeing a fair amount of

Cannot obtain an exclusive lease for container_registry:migration:enqueuer_worker. There must be another instance already in execution.

As such, it seems that we have some many jobs running in parallel that we don't achieve the throughput that we want. Currently, we have a max limit of 25 migrations that be started in parallel. Looking at Grafana, we only max ~11-15 migrations in parallel.

We tried to update the lease key so that its name is scoped to the repository that has been picked up. This attempt failed in unexpected ways (race condition).

We're now changing the approach. What if:

  • we keep the global exclusive lease that we know it works with the high concurrency that we have.
  • we modify the Enqueuer job so that instead of picking one single repo, it simply loops on them until:
    • the max capacity is reached or
    • a deadline is reached or
    • the related feature flag is disabled.

That's issue #362117 (closed).

🔬 What does this MR do and why?

  • Update the ContainerRegistry::Migration::Enqueuer to have a "loop" mode.
  • Gate this mode behind a feature flag so that we can switch between "loop" and "non loop"(current) mode.
    • We have a common rollout issue for all the feature flags around the registry migration: #350543.
  • Update the related specs.

We don't have a changelog here as the registry migration is gated behind multiple feature flags. The global one is only enabled on gitlab.com.

📺 Screenshots or screen recordings

n / a

How to set up and validate locally

All the cases and situations that the Enqueuer can face are quite hard to reproduce locally. So for this validation, I will focus on having 2 aborted migrations and 4 next migrations with a capacity of 4. The 2 aborted migration + 2 from the next migrations should be handled in one execution.

  1. Have a GDK ready with the registry setup.
  2. In a rails console, enable the registry migration feature flags and the loop mode:
    Gitlab::CurrentSettings.update!(container_registry_import_max_tags_count: 0)
    Feature.enable(:container_registry_migration_phase2_enabled)
    Feature.enable(:container_registry_migration_phase2_enqueue_speed_fast)
    Feature.enable(:container_registry_migration_phase2_enqueuer_loop)
  3. Insert 2 aborted migrations:
    FactoryBot.create(:container_repository, :import_aborted, project: Project.first, created_at: 3.years.ago)
    FactoryBot.create(:container_repository, :import_aborted, project: Project.first, created_at: 3.years.ago)
  4. Insert 4 next migrations:
    FactoryBot.create(:container_repository, project: Project.first, created_at: 3.years.ago)
    FactoryBot.create(:container_repository, project: Project.first, created_at: 3.years.ago)
    FactoryBot.create(:container_repository, project: Project.first, created_at: 3.years.ago)
    FactoryBot.create(:container_repository, project: Project.first, created_at: 3.years.ago)

Now, we're going to fake interactions with the container registry. That's because having a container registry for the migration is quite involved.

  1. Update ContainerRepository#external_import_status with:
    def external_import_status
      'pre_import_failed'
    end
  2. Update ContainerRepository#migration_pre_import with:
    def migration_pre_import
      :ok
    end
  3. Update ContainerRegistry::Migration.capacity with:
    def self.capacity
      4
    end

We're all set now. Let's see what happens when an Enqueuer job is executed: ( reload or restart the rails console)

  1. Check the counts on migration states:
    ContainerRepository.group(:migration_state).count # {"default"=>4, "import_aborted"=>2}
  2. Let's execute the Enqueuer:
    ContainerRegistry::Migration::EnqueuerWorker.perform_in(5.seconds)
  3. Check the counts again:
    ContainerRepository.group(:migration_state).count # {"default"=>2, "pre_importing"=>4}

In the sidekiq logs, you should see lines like these:

{"severity":"INFO","import_type":"retry","container_repository_id":78,"container_repository_path":"gitlab-org/gitlab-test/test_image_1","container_repository_migration_state":"pre_importing","class":"ContainerRegistry::Migration::EnqueuerWorker","job_status":"running"}
{"severity":"INFO","import_type":"retry","container_repository_id":79,"container_repository_path":"gitlab-org/gitlab-test/test_image_2","container_repository_migration_state":"pre_importing","class":"ContainerRegistry::Migration::EnqueuerWorker","job_status":"running"}
{"severity":"INFO","import_type":"next","container_repository_id":80,"container_repository_path":"gitlab-org/gitlab-test/test_image_3","container_repository_migration_state":"pre_importing","class":"ContainerRegistry::Migration::EnqueuerWorker","job_status":"running"}
{"severity":"INFO","import_type":"next","container_repository_id":81,"container_repository_path":"gitlab-org/gitlab-test/test_image_4","container_repository_migration_state":"pre_importing","class":"ContainerRegistry::Migration::EnqueuerWorker","job_status":"running"}

We can see 2 retries and 2 next imports that gives us 4 migrations which is the max capacity we used. So out of the 6 image repositories waiting for the migration, 2 are still waiting.

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

🙏 Thanks

Credits where they are due. The loop mode idea was first tossed by @reprazent in !86669 (comment 938016570).

Edited by David Fernandez

Merge request reports