Skip to content

Registry phase 2: guard worker

David Fernandez requested to merge 349745-guard-job into master

🍒 Context

We're currently migrating container images in the Container Registry. To keep things simple (this not accurate): the Container Registry team is adding support for a database and that database will host informations about the container images. Newly created container images will already be inserted in the database but existing ones need to be migrated.

In &7316 (closed), it has been described that rails will "drive" this migration. Put simply, using an API on the Container Registry, rails will start the migration for container images and will receive notifications about it.

The migration is basically two steps: pre_import and import. After each step, the Container Registry will notify the rails backend. Example: Hey for container image "my/awesome/image", step "pre_import" has completed..

To follow migration evolutions on the rails side, we put in place a migration_state column that is handled by a state machine. This allows us to easily trigger actions when a certain state is reached. Here is a simplification of the interesting states for this MR:

stateDiagram-v2
    [*] --> default
    default --> pre_importing
    pre_importing --> pre_import_done
    pre_import_done --> importing
    importing --> import_done
    pre_importing --> aborted
    pre_import_done --> aborted
    importing --> aborted
    import_done --> [*]
    aborted --> [*]

Now, container registry notifications to rails are not reliable. They can be dropped. In addition, we can face situations where either backend is stopped or terminated for whatever reason and basically we lose the migration state sync between the container registry and rails.

To counter this, this MR introduces a "guard" background job (yep, naming is hard). It will basically scan through pre_importing, pre_import_done and importing states, detect stale migrations and abort them. That's issue #349745 (closed).

Aborted migrations are retried automatically by the other worker that handles the migration: the Enqueuer. This is how way to reconcile an unsynced migration state.

The core responsibility of the guard job is to look after migrations that have been in the importing state for too long. That's because the import step needs to be as short as possible. In this step, the container image is put in read only mode = no new tags can be pushed to the image = degraded UX. 😿

🔭 What does this MR do and why?

  • Update the gitlab_api_client of the registry to support the #migration_status function
    • That's what will get the migration status from the Container Registry API
  • Update the ContainerRepository model with:
    • scopes to selecting stale migrations
    • global scope to get all stale migrations in a single call
  • Add a ContainerRegistry::Migration::GuardWorker
    • This will be a cron worker. Frequency, each 10.minutes.
    • It is deduplicated of course
    • Will read ContainerRegistry::Migration.max_step_duration.seconds
      • This is the setting that defines what is a stale state. For example, we could say that migration can't spend more than 5.minutes in a given state.
    • For added safety, the amount of stale migrations pulled are limited. We have ContainerRegistry::Migration.capacity that defines the max amount of ongoing migrations. This is modified by feature flag s. The absolute maximum is 25 parallel migrations.
    • The whole registry migration is currently scoped to gitlab.com only. As such, on self-managed instances, this job will be a no-op

🖼 Screenshots or screen recordings

See next section

How to set up and validate locally

  1. The worker is checking ::Gitlab.com? in its #perform function. Disable that part for local testing.
  2. Enable the max capacity Feature.enable(:container_registry_migration_phase2_capacity_25)

1️⃣ Stale importing migrations

In a rails console:

  1. Let's create a container repository in the right migration status
    image = FactoryBot.create(:container_repository, :importing, project: Project.last)
  2. Let's make it stale
    image.update!(migration_import_started_at: 1.hour.ago)
  3. Run the guard job
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Check the status
    image.reload.migration_state # "import_aborted"

2️⃣ Stale pre_importing migrations + pre_import_in_progress

  1. Let's create a container repository in the right migration status
    image = FactoryBot.create(:container_repository, :pre_importing, project: Project.last)
  2. Let's make it stale
    image.update!(migration_pre_import_started_at: 1.hour.ago)
  3. Run the guard job
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Check the status
    image.reload.migration_state # "import_aborted"

3️⃣ Stale pre_import_done migrations + pre_import_complete

  1. Let's create a container repository in the right migration status
    image = FactoryBot.create(:container_repository, :pre_import_done, project: Project.last)
  2. Let's make it stale
    image.update!(migration_pre_import_done_at: 1.hour.ago)
  3. Run the guard job
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Check the status
    image.reload.migration_state # "import_aborted"

🏁 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

💾 Database Review

🤖 Queries

One word on the setup I used to test the queries. Right now, we don't have any migration going on. As such, we have no container images in state pre_importing, pre_import_done or importing.

To solve this, I used an UPDATE statement that will set container images (randomly) in the proper state. Those setup instructions are on each query analysis.

Lastly, I created a sample of 200 container images that the query will get. This is 8 times the actual max number of container images being migrated at any given time. See https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/container_registry/migration.rb#L33 (25). I did this on purpose so that if we ever increase the capacity (the absolute max is the number of nodes in the container registry fleet, which is about 50), this job still have a good ~performance.

👆 Migration Up

== 20220202115350 AddMigrationIndexesToContainerRepositories: migrating =======
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:migration_pre_import_started_at], {:name=>"idx_container_repos_on_pre_import_started_at_when_pre_importing", :where=>"migration_state = 'pre_importing'", :algorithm=>:concurrently})
   -> 0.0051s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- add_index(:container_repositories, [:migration_pre_import_started_at], {:name=>"idx_container_repos_on_pre_import_started_at_when_pre_importing", :where=>"migration_state = 'pre_importing'", :algorithm=>:concurrently})
   -> 0.0117s
-- execute("RESET statement_timeout")
   -> 0.0006s
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:migration_pre_import_done_at], {:name=>"idx_container_repos_on_pre_import_done_at_when_pre_import_done", :where=>"migration_state = 'pre_import_done'", :algorithm=>:concurrently})
   -> 0.0033s
-- add_index(:container_repositories, [:migration_pre_import_done_at], {:name=>"idx_container_repos_on_pre_import_done_at_when_pre_import_done", :where=>"migration_state = 'pre_import_done'", :algorithm=>:concurrently})
   -> 0.0037s
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:migration_import_started_at], {:name=>"idx_container_repos_on_import_started_at_when_importing", :where=>"migration_state = 'importing'", :algorithm=>:concurrently})
   -> 0.0036s
-- add_index(:container_repositories, [:migration_import_started_at], {:name=>"idx_container_repos_on_import_started_at_when_importing", :where=>"migration_state = 'importing'", :algorithm=>:concurrently})
   -> 0.0030s
== 20220202115350 AddMigrationIndexesToContainerRepositories: migrated (0.0432s) 

👇 Migration Down

== 20220202115350 AddMigrationIndexesToContainerRepositories: reverting =======
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_repositories)
   -> 0.0057s
-- execute("SET statement_timeout TO 0")
   -> 0.0006s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"idx_container_repos_on_import_started_at_when_importing"})
   -> 0.0041s
-- execute("RESET statement_timeout")
   -> 0.0007s
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_repositories)
   -> 0.0035s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"idx_container_repos_on_pre_import_done_at_when_pre_import_done"})
   -> 0.0019s
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_repositories)
   -> 0.0029s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"idx_container_repos_on_pre_import_started_at_when_pre_importing"})
   -> 0.0016s
== 20220202115350 AddMigrationIndexesToContainerRepositories: reverted (0.0261s) 
Edited by David Fernandez

Merge request reports