Registry phase 2: guard worker
🍒 Context
We're currently migrating container images in the Container Registry. To keep things simple (this not accurate): the Container Registry team is adding support for a database and that database will host informations about the container images. Newly created container images will already be inserted in the database but existing ones need to be migrated.
In &7316 (closed), it has been described that rails will "drive" this migration. Put simply, using an API on the Container Registry, rails will start the migration for container images and will receive notifications about it.
The migration is basically two steps: pre_import
and import
. After each step, the Container Registry will notify the rails backend. Example: Hey for container image "my/awesome/image", step "pre_import" has completed.
.
To follow migration evolutions on the rails side, we put in place a migration_state
column that is handled by a state machine. This allows us to easily trigger actions when a certain state is reached. Here is a simplification of the interesting states for this MR:
stateDiagram-v2
[*] --> default
default --> pre_importing
pre_importing --> pre_import_done
pre_import_done --> importing
importing --> import_done
pre_importing --> aborted
pre_import_done --> aborted
importing --> aborted
import_done --> [*]
aborted --> [*]
Now, container registry notifications to rails are not reliable. They can be dropped. In addition, we can face situations where either backend is stopped or terminated for whatever reason and basically we lose the migration state sync between the container registry and rails.
To counter this, this MR introduces a "guard" background job (yep, naming is hard). It will basically scan through pre_importing
, pre_import_done
and importing
states, detect stale migrations and abort them. That's issue #349745 (closed).
Aborted migrations are retried automatically by the other worker that handles the migration: the Enqueuer. This is how way to reconcile an unsynced migration state.
The core responsibility of the guard job is to look after migrations that have been in the importing
state for too long. That's because the import
step needs to be as short as possible. In this step, the container image is put in read only mode = no new tags can be pushed to the image = degraded UX.
🔭 What does this MR do and why?
- Update the
gitlab_api_client
of the registry to support the#migration_status
function- That's what will get the migration status from the Container Registry API
- Update the
ContainerRepository
model with:- scopes to selecting stale migrations
- global scope to get all stale migrations in a single call
- Add a
ContainerRegistry::Migration::GuardWorker
- This will be a cron worker. Frequency, each
10.minutes
. - It is deduplicated of course
- Will read
ContainerRegistry::Migration.max_step_duration.seconds
- This is the setting that defines what is a stale state. For example, we could say that migration can't spend more than
5.minutes
in a given state.
- This is the setting that defines what is a stale state. For example, we could say that migration can't spend more than
- For added safety, the amount of stale migrations pulled are limited. We have
ContainerRegistry::Migration.capacity
that defines the max amount of ongoing migrations. This is modified by feature flag s. The absolute maximum is25
parallel migrations. - The whole registry migration is currently scoped to gitlab.com only. As such, on self-managed instances, this job will be a
no-op
- This will be a cron worker. Frequency, each
🖼 Screenshots or screen recordings
See next section
⚙ How to set up and validate locally
-
⚠ The worker is checking::Gitlab.com?
in its#perform
function. Disable that part for local testing. - Enable the max capacity
Feature.enable(:container_registry_migration_phase2_capacity_25)
1️⃣ Stale importing
migrations
In a rails console:
- Let's create a container repository in the right migration status
image = FactoryBot.create(:container_repository, :importing, project: Project.last)
- Let's make it stale
image.update!(migration_import_started_at: 1.hour.ago)
- Run the guard job
ContainerRegistry::Migration::GuardWorker.new.perform
- Check the status
image.reload.migration_state # "import_aborted"
2️⃣ Stale pre_importing
migrations + pre_import_in_progress
- Let's create a container repository in the right migration status
image = FactoryBot.create(:container_repository, :pre_importing, project: Project.last)
- Let's make it stale
image.update!(migration_pre_import_started_at: 1.hour.ago)
- Run the guard job
ContainerRegistry::Migration::GuardWorker.new.perform
- Check the status
image.reload.migration_state # "import_aborted"
3️⃣ Stale pre_import_done
migrations + pre_import_complete
- Let's create a container repository in the right migration status
image = FactoryBot.create(:container_repository, :pre_import_done, project: Project.last)
- Let's make it stale
image.update!(migration_pre_import_done_at: 1.hour.ago)
- Run the guard job
ContainerRegistry::Migration::GuardWorker.new.perform
- Check the status
image.reload.migration_state # "import_aborted"
🏁 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
💾 Database Review
🤖 Queries
One word on the setup I used to test the queries. Right now, we don't have any migration going on. As such, we have no container images in state pre_importing
, pre_import_done
or importing
.
To solve this, I used an UPDATE
statement that will set container images (randomly) in the proper state. Those setup instructions are on each query analysis.
Lastly, I created a sample of 200
container images that the query will get. This is 8 times the actual max number of container images being migrated at any given time. See https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/container_registry/migration.rb#L33 (25
). I did this on purpose so that if we ever increase the capacity (the absolute max is the number of nodes in the container registry fleet, which is about 50
), this job still have a good ~performance.
👆 Migration Up
== 20220202115350 AddMigrationIndexesToContainerRepositories: migrating =======
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:migration_pre_import_started_at], {:name=>"idx_container_repos_on_pre_import_started_at_when_pre_importing", :where=>"migration_state = 'pre_importing'", :algorithm=>:concurrently})
-> 0.0051s
-- execute("SET statement_timeout TO 0")
-> 0.0005s
-- add_index(:container_repositories, [:migration_pre_import_started_at], {:name=>"idx_container_repos_on_pre_import_started_at_when_pre_importing", :where=>"migration_state = 'pre_importing'", :algorithm=>:concurrently})
-> 0.0117s
-- execute("RESET statement_timeout")
-> 0.0006s
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:migration_pre_import_done_at], {:name=>"idx_container_repos_on_pre_import_done_at_when_pre_import_done", :where=>"migration_state = 'pre_import_done'", :algorithm=>:concurrently})
-> 0.0033s
-- add_index(:container_repositories, [:migration_pre_import_done_at], {:name=>"idx_container_repos_on_pre_import_done_at_when_pre_import_done", :where=>"migration_state = 'pre_import_done'", :algorithm=>:concurrently})
-> 0.0037s
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:migration_import_started_at], {:name=>"idx_container_repos_on_import_started_at_when_importing", :where=>"migration_state = 'importing'", :algorithm=>:concurrently})
-> 0.0036s
-- add_index(:container_repositories, [:migration_import_started_at], {:name=>"idx_container_repos_on_import_started_at_when_importing", :where=>"migration_state = 'importing'", :algorithm=>:concurrently})
-> 0.0030s
== 20220202115350 AddMigrationIndexesToContainerRepositories: migrated (0.0432s)
👇 Migration Down
== 20220202115350 AddMigrationIndexesToContainerRepositories: reverting =======
-- transaction_open?()
-> 0.0000s
-- indexes(:container_repositories)
-> 0.0057s
-- execute("SET statement_timeout TO 0")
-> 0.0006s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"idx_container_repos_on_import_started_at_when_importing"})
-> 0.0041s
-- execute("RESET statement_timeout")
-> 0.0007s
-- transaction_open?()
-> 0.0000s
-- indexes(:container_repositories)
-> 0.0035s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"idx_container_repos_on_pre_import_done_at_when_pre_import_done"})
-> 0.0019s
-- transaction_open?()
-> 0.0000s
-- indexes(:container_repositories)
-> 0.0029s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"idx_container_repos_on_pre_import_started_at_when_pre_importing"})
-> 0.0016s
== 20220202115350 AddMigrationIndexesToContainerRepositories: reverted (0.0261s)