Phase 2 enqueuer
🏛 Context
We are preparing for Phase 2 of the Container Registry migration which involves importing all existing container repositories to the new platform (Phase 1 involved routing all new container repositories to the new platform). See &7316 (closed) for full details of how the import will work.
Rails is responsible for starting each import. This introduces the EnqueuerWorker
, which will query the container_repositories
table, find the next repository that qualifies for import, and make a request to the registry to start the pre-import.
🔬 What does this MR do and why?
This MR introduces the EnqueuerWorker
. It is responsible for finding the next container repository that qualifies for import and kicking off that import. It follows a sequence of checks:
- Return unless the main import feature flag
:container_registry_migration_phase2_enabled
is enabled. - Return if there are too many container repositories currently being imported.
- Return if there has not been a long enough delay between imports (eventually this will move to 0 delay, but we are starting off we want to go one at a time).
- Check if there are any imports that were aborted. If one is found, restart it and return.
- Find the next container repository that qualifies for import.
- We are following a rollout plan where we import one pricing tier at a time with a few other rules.
- If the qualified repository has too many tags, skip it and return.
- Start the import for the qualified repository.
- If starting or retrying an import fails, abort the import so it can try again later.
A more detailed description of these steps can be found in the issue description.
The EnqueuerWorker
will be kicked off in two ways:
- A cron running every hour will run the worker, starting an import
- Whenever an import is completed, the worker will be kicked off to start a new import
The cron ensures that imports will keep trying, especially while we are starting out and have everything throttled down using the various feature flags and application settings in ContainerRegistry::Migration
.
There are many calls to methods in ::ContainerRegistry::Migration
. These all are checking feature flag and application setting values. Since we are using a fairly large number of settings and feature flags to control the import rollout, they have been centralized to a single class to keep things organized.
🐘 Database
Queries
This MR introduces 4 new scopes that in turn make up 4 new queries:
ContainerRepository.with_migration_states(%w[pre_importing pre_import_done importing]).count
1. Query:
SELECT COUNT(*)
FROM "container_repositories"
WHERE "container_repositories"."migration_state"
IN ('pre_importing', 'pre_import_done', 'importing');
Explain:
- Without new index - Seq Scan
👎 : https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8309/commands/29359 - With new index - Index only scan
👍 : https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8306/commands/29357
Note: currently all container repositories have a 'default'
migration_state
, so in adding the index and updating some values on postgres.ai, we cannot achieve a cold-cache query. In addition to seeing that we have a better explain plan using the new index, the thing to notice for this and all of the queries using the new index is that the total number of buffers (hits + read) is much lower.
ContainerRepository.recently_done_migration_step.first
2. This query uses a new index index_container_repositories_on_greatest_done_at
Query:
SELECT "container_repositories".*
FROM "container_repositories"
WHERE "container_repositories"."migration_state" IN ('import_done', 'pre_import_done', 'import_aborted')
ORDER BY GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at) DESC
LIMIT 1;
To set up some data for this query in postgres.ai:
UPDATE container_repositories SET migration_state = 'import_done',
migration_import_done_at = (
select timestamp '2020-01-10 00:00:00' + random() * (timestamp '2022-01-01 00:00:00' - timestamp '2020-01-01 00:00:00')
) WHERE id % 100 = 0;
UPDATE container_repositories SET migration_state = 'pre_import_done',
migration_pre_import_done_at = (
select timestamp '2020-01-10 00:00:00' + random() * (timestamp '2022-01-01 00:00:00' - timestamp '2020-01-01 00:00:00')
) WHERE id % 425 = 0;
UPDATE container_repositories SET migration_state = 'import_aborted',
migration_aborted_at = (
select timestamp '2020-01-10 00:00:00' + random() * (timestamp '2022-01-01 00:00:00' - timestamp '2020-01-01 00:00:00')
) WHERE id % 900 = 0;
Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8442/commands/29887
ContainerRepository.with_migration_state('import_aborted').take
3. Query:
SELECT "container_repositories".*
FROM "container_repositories"
WHERE "container_repositories"."migration_state" = 'import_aborted'
LIMIT 1
ContainerRepository.ready_for_import.take
4. Explain:
- Without new index - Seq scan
👎 : https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8309/commands/29366 - With new index - Index scan
👍 : https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8309/commands/29370
The .ready_for_import
scope contains .with_target_import_tier
which is overwritten in EE and there is additionally a feature flag that can effect how the query is formed. Performance benefits greatly since we have no ORDER
and only need 1 record LIMIT 1
.
Note that the EE permutations have a guard so they will only execute on GitLab.com.
Here is each permutation with notes about when and how often they will be used:
EE - .with_target_import_tier
filters by plan name
This query occurs when the feature flag :container_registry_migration_limit_gitlab_org
is disabled. This is the most complicated query (most joins and filters). This is the query that will be used the majority of the time for the GitLab.com migration.
This query might benefit from an index since over time, the migration_state
of the container repositories will move from default
to import_done
, but I didn't want to pre-maturely add the index. I'm open to looking further into it if there are any specific ideas.
SELECT "container_repositories".*
FROM "container_repositories"
INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id"
INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id"
INNER JOIN "gitlab_subscriptions" ON "gitlab_subscriptions"."namespace_id" = "namespaces"."id"
INNER JOIN "plans" ON "plans"."id" = "gitlab_subscriptions"."hosted_plan_id"
WHERE "container_repositories"."migration_state" = 'default'
AND "container_repositories"."created_at" < '2022-01-01 00:00:00'
AND "plans"."name" = 'free'
AND (
NOT EXISTS (
SELECT 1
FROM feature_gates
WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
AND feature_gates.key = 'actors'
AND feature_gates.value = concat('Group:', projects.namespace_id)
)
) LIMIT 1;
Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8310/commands/29375
EE - .with_target_import_tier
filters repositories for `gitlab-org` group
This query occurs when the feature flag :container_registry_migration_limit_gitlab_org
is enabled.
This will be used to allow us to start by only importing container repositories belonging to the gitlab-org
group.
SELECT "container_repositories".*
FROM "container_repositories"
INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id"
INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id"
WHERE "container_repositories"."migration_state" = 'default'
AND "container_repositories"."created_at" < '2022-01-01 00:00:00'
AND "namespaces"."path" = 'gitlab-org'
AND (
NOT EXISTS (
SELECT 1
FROM feature_gates
WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
AND feature_gates.key = 'actors'
AND feature_gates.value = concat('Group:', projects.namespace_id)
)
) LIMIT 1
Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8411/commands/29674
FOSS - .with_target_import_tier`
returns `all`
This is the least complicated query (least joins and filters). This is what will be run on self-managed instances that use the import process.
SELECT "container_repositories".*
FROM "container_repositories"
INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id"
WHERE "container_repositories"."migration_state" = 'default'
AND "container_repositories"."created_at" < '2022-01-23 00:00:00'
AND (
NOT EXISTS (
SELECT 1
FROM feature_gates
WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
AND feature_gates.key = 'actors'
AND feature_gates.value = concat('Group:', projects.namespace_id)
)
) LIMIT 1;
Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8306/commands/29333
Migrations
Migration output
→ bundle exec rails db:migrate:redo
== 20220128194722 AddIndexOnMigrationStateAndImportDoneAtToContainerRepositories: reverting
-- transaction_open?()
-> 0.0000s
-- indexes(:container_repositories)
-> 0.0052s
-- execute("SET statement_timeout TO 0")
-> 0.0007s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"index_container_repositories_on_migration_state_import_done_at"})
-> 0.0064s
-- execute("RESET statement_timeout")
-> 0.0007s
== 20220128194722 AddIndexOnMigrationStateAndImportDoneAtToContainerRepositories: reverted (0.0154s)
== 20220128194722 AddIndexOnMigrationStateAndImportDoneAtToContainerRepositories: migrating
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:migration_state, :migration_import_done_at], {:name=>"index_container_repositories_on_migration_state_import_done_at", :algorithm=>:concurrently})
-> 0.0068s
-- execute("SET statement_timeout TO 0")
-> 0.0006s
-- add_index(:container_repositories, [:migration_state, :migration_import_done_at], {:name=>"index_container_repositories_on_migration_state_import_done_at", :algorithm=>:concurrently})
-> 0.0082s
-- execute("RESET statement_timeout")
-> 0.0011s
== 20220128194722 AddIndexOnMigrationStateAndImportDoneAtToContainerRepositories: migrated (0.0211s)
→ bundle exec rake db:redo
== 20220204154220 AddIndexOnGreatestDoneAtToContainerRepositories: reverting ==
-- transaction_open?()
-> 0.0000s
-- indexes(:container_repositories)
-> 0.0057s
-- execute("SET statement_timeout TO 0")
-> 0.0009s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"index_container_repositories_on_greatest_done_at"})
-> 0.0066s
-- execute("RESET statement_timeout")
-> 0.0006s
== 20220204154220 AddIndexOnGreatestDoneAtToContainerRepositories: reverted (0.0185s)
== 20220204154220 AddIndexOnGreatestDoneAtToContainerRepositories: migrating ==
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, "GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at)", {:where=>"migration_state IN ('import_done', 'pre_import_done', 'import_aborted')", :name=>"index_container_repositories_on_greatest_done_at", :algorithm=>:concurrently})
-> 0.0061s
-- execute("SET statement_timeout TO 0")
-> 0.0006s
-- add_index(:container_repositories, "GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at)", {:where=>"migration_state IN ('import_done', 'pre_import_done', 'import_aborted')", :name=>"index_container_repositories_on_greatest_done_at", :algorithm=>:concurrently})
-> 0.0153s
-- execute("RESET statement_timeout")
-> 0.0007s
== 20220204154220 AddIndexOnGreatestDoneAtToContainerRepositories: migrated (0.0321s)
📸 Screenshots or screen recordings
See below
💻 How to set up and validate locally
We cannot fully test the functionality because the import functionality is still being developed in the Container Registry, so any requests to import a repository will result in an error. This does mean, however, we can test that imports are aborted properly and follow the application settings and feature flags in place.
-
Set up the feature flags:
Feature.enable(:container_registry_migration_phase2_enabled) Feature.enable(:container_registry_migration_phase2_capacity_1) Feature.disable(:container_registry_migration_phase2_enqueue_speed_fast) Feature.disable(:container_registry_migration_phase2_enqueue_speed_slow)
-
Create some container repositories in the console and set them to be created a few months ago so they qualify for import:
10.times { FactoryBot.create(:container_repository, project: Project.first) } ContainerRepository.update_all(created_at: 3.months.ago) ContainerRepository.where(migration_state: 'default').count # => 10
-
Run the worker
ContainerRegistry::Migration::EnqueuerWorker.set(queue: 'cronjob:container_registry_migration_enqueuer').perform_in(1.second)
-
Check the container repositories, the first one should have been aborted
ContainerRepository.where(migration_state: 'default').count # => 9 ContainerRepository.where.not(migration_state: 'default').first.migration_state # => "import_aborted" # Since the registry cannot be connected to in these tests, we receive an error and the import is aborted
-
Set the first repository as recently imported:
ContainerRepository.first.update(migration_state: 'import_done', migration_import_done_at: 5.minutes.go)
-
Rerun the worker and see no repositories are updated:
ContainerRegistry::Migration::EnqueuerWorker.set(queue: 'cronjob:container_registry_migration_enqueuer').perform_in(1.second) ContainerRepository.where(migration_state: 'default').count # => 9
-
Update the waiting time feature flag:
Feature.enable(:container_registry_migration_phase2_enqueue_speed_fast)
-
Rerun the worker and see another repository has been updated
ContainerRegistry::Migration::EnqueuerWorker.set(queue: 'cronjob:container_registry_migration_enqueuer').perform_in(1.second) ContainerRepository.where(migration_state: 'default').count # => 8
📐 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #349744 (closed)