Skip to content

Add registry migration Guard dynamic pre import timeout

🛡 Context

We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend. For all the nitty-gritty details, see &7316 (comment 897867569).

The migration is done through several states. There is a defined orchestration between rails and the Container Registry to move a migration through those states.

At the center of the rails side lies two workers: the Enqueuer (the one that starts migrations) and the Guard workers. We're going to focus on the Guard worker.

Its responsibilities is mainly:

  1. detecting stale migrations.
  2. verify those stale migrations with the Container Registry.
    • The Container Registry holds the source of truth on the migration state.
  3. within those stale migrations, detect "long" running migrations.

In this MR, we're going to zoom on (3.). The goal is basically detect that a migration has been in specific states for way too long and actively cancel the migration if some thresholds have been hit. One of those states is the pre import state.

Initially, the migration plan was to migrate image repositories on Free plan projects with a low tags count (< 100). That plan has been changed to support way large tags count.

The problem we're facing is that we have quite a mix of tags counts. The pre import state execution time on the Container Registry directly depends on the tags counts: the more tags an image have, the longer the pre import step will be. Because of this mix of tags counts, we have a hard time to tweak a proper pre import execution time threshold. What we want with this threshold is to detect genuine stuck migrations as soon as possible to cancel them on the Container Registry (they will be retried a few times). What we hit is the following:

  • if the timeout is too low, image repositories with a large tags counts will get canceled because they will hit timeout. Canceling those migrations is not great because:
    • It's a false positive. The pre import is not stuck. It is simply taking a lot of time.
    • That same migration will be retried a few times and get canceled too. Because the pre import step is quite long (30min+) that means that we're wasting Container Registry resources on a migration that will get canceled all the time.
  • if the timeout is too large, a migration can get genuinely stuck and hold Container Registry resources.

Overall, both points will lead to a lower throughput of the all architecture.

With this MR, we want to introduce a dynamic pre import timeout. That's issue #363048 (closed). Basically,

  • We have a fixed timeout for the pre import.
  • When we hit it, we actually take an additional step: we check with the Container Registry, tags count.
    • That's an additional network request to the Container Registry.
  • With the real tags count, we compute how much time the pre import should take. For that, we introduce an additional container registry migration application setting: the pre import tags rate, in tag/s.
  • From the previous point, we apply the same logic: if the pre import is within bounds, it's all good. If it is out of bounds, rails will cancel the migration.

🔬 What does this MR do and why?

  • Add a new container registry migration application setting: container_registry_pre_import_tags_rate
  • Update the Guard worker for the pre import long running check:
    • when the fixed timeout is hit, compute the dynamic timeout and check the runtime against that.
  • Update the related specs.
  • Gate this change behind a feature flag : registry_migration_guard_dynamic_pre_import_timeout.
    • From the past fixes, we need a way to "go back" to the old behavior if there is an issue with this change on the current migration.

The registry migration is currently ongoing for gitlab.com only. As such, it is gated behind several feature flag.

🖥 Screenshots or screen recordings

n / a

How to set up and validate locally

Testing the whole chain (Rails <-> Container Registry) is quite involved and hard. Having said that, we can replace a few functions to "stub" the Container Registry responses.

  1. Have GDK ready with registry support.
  2. Enable the pre import dynamic timeout:
    Feature.enable(:registry_migration_guard_dynamic_pre_import_timeout)
  3. Create an image repository stalled in pre_import.
    repo = FactoryBot.create(:container_repository, :pre_importing, project: Project.first, migration_pre_import_started_at: 5.minutes.ago)
  4. Update the Guard worker to run on not .com envs:
    def perform
      # return unless Gitlab.com? <- comment this line
      # ...
    end
  5. Update the ContainerRepository#external_import_status function to:
    def external_import_status
      'pre_import_in_progress'
    end
  6. Update the ContainerRepository#migration_cancel function to:
    def migration_cancel
      { status: :ok }
    end
  7. Finally, update ContainerRepository#tags_count. The default pre import tags rate is 1 tag/s. Given that we started the pre import 10.minutes ago, let's put a number that will trigger a cancel:
    def tags_count
      10
    end
  8. Let's execute the Guard job ( You might need to reload! to take into account the above changes in the code)
    ContainerRegistry::Migration::GuardWorker.new.perform
  9. Let's check the repo:
    repo.reload.migration_state # imported_aborted

The repo has a (simulated) tags count of 10 and the pre import started 10 minutes ago. That's way too long for the expected rate (1 tag/s). The migration was aborted .

Nice, now let's try the dynamic timeout with a very large repo:

  1. Update ContainerRepository#tags_count to:
    def tags_count
      2000
    end
  2. Create an image repository stalled in pre importing:
    repo = FactoryBot.create(:container_repository, :pre_importing, project: Project.first, migration_pre_import_started_at: 5.minutes.ago)
  3. Let's run the Guard worker ( you might need reload!)
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Let's check the repo:
    repo.reload.migration_state # pre_importing

The repo has a (simulated) tags count of 2000 and the pre import started 10 minutes ago. With the expected rate (1 tag/s), the pre import timeout is at 2000 seconds (33 minutes). We're still within the limits. The migration was not canceled and was left going.

The dynamic timeout is working as expected 🎉

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

💾 Database review

Migration up

$ rails db:migrate 
== 20220520144821 AddRegistryMigrationPreImportTagsRateToApplicationSettings: migrating 
-- add_column(:application_settings, :container_registry_pre_import_tags_rate, :decimal, {:precision=>6, :scale=>2, :default=>0.5, :null=>false})
   -> 0.0070s
== 20220520144821 AddRegistryMigrationPreImportTagsRateToApplicationSettings: migrated (0.0071s) 

== 20220524191259 AddApplicationSettingsContainerRegistryPreImportTagsRateConstraint: migrating 
-- transaction_open?()
   -> 0.0000s
-- current_schema()
   -> 0.0004s
-- transaction_open?()
   -> 0.0000s
-- execute("ALTER TABLE application_settings\nADD CONSTRAINT app_settings_container_registry_pre_import_tags_rate_positive\nCHECK ( container_registry_pre_import_tags_rate >= 0 )\nNOT VALID;\n")
   -> 0.0023s
-- current_schema()
   -> 0.0003s
-- execute("SET statement_timeout TO 0")
   -> 0.0006s
-- execute("ALTER TABLE application_settings VALIDATE CONSTRAINT app_settings_container_registry_pre_import_tags_rate_positive;")
   -> 0.0013s
-- execute("RESET statement_timeout")
   -> 0.0006s
== 20220524191259 AddApplicationSettingsContainerRegistryPreImportTagsRateConstraint: migrated (0.0209s)  

Migration down

$ rails db:rollback
== 20220524191259 AddApplicationSettingsContainerRegistryPreImportTagsRateConstraint: reverting 
-- transaction_open?()
   -> 0.0000s
-- transaction_open?()
   -> 0.0000s
-- execute("ALTER TABLE application_settings\nDROP CONSTRAINT IF EXISTS app_settings_container_registry_pre_import_tags_rate_positive\n")
   -> 0.0020s
== 20220524191259 AddApplicationSettingsContainerRegistryPreImportTagsRateConstraint: reverted (0.0120s) 

== 20220520144821 AddRegistryMigrationPreImportTagsRateToApplicationSettings: reverting 
-- remove_column(:application_settings, :container_registry_pre_import_tags_rate, :decimal, {:precision=>6, :scale=>2, :default=>0.5, :null=>false})
   -> 0.0041s
== 20220520144821 AddRegistryMigrationPreImportTagsRateToApplicationSettings: reverted (0.0070s)
Edited by David Fernandez

Merge request reports