Skip to content

Phase 2 enqueuer

Steve Abrams requested to merge 349744-phase2-enqueuer into master

🏛 Context

We are preparing for Phase 2 of the Container Registry migration which involves importing all existing container repositories to the new platform (Phase 1 involved routing all new container repositories to the new platform). See &7316 (closed) for full details of how the import will work.

Rails is responsible for starting each import. This introduces the EnqueuerWorker, which will query the container_repositories table, find the next repository that qualifies for import, and make a request to the registry to start the pre-import.

🔬 What does this MR do and why?

This MR introduces the EnqueuerWorker. It is responsible for finding the next container repository that qualifies for import and kicking off that import. It follows a sequence of checks:

  1. Return unless the main import feature flag :container_registry_migration_phase2_enabled is enabled.
  2. Return if there are too many container repositories currently being imported.
  3. Return if there has not been a long enough delay between imports (eventually this will move to 0 delay, but we are starting off we want to go one at a time).
  4. Check if there are any imports that were aborted. If one is found, restart it and return.
  5. Find the next container repository that qualifies for import.
    • We are following a rollout plan where we import one pricing tier at a time with a few other rules.
  6. If the qualified repository has too many tags, skip it and return.
  7. Start the import for the qualified repository.
  8. If starting or retrying an import fails, abort the import so it can try again later.

A more detailed description of these steps can be found in the issue description.

The EnqueuerWorker will be kicked off in two ways:

  1. A cron running every hour will run the worker, starting an import
  2. Whenever an import is completed, the worker will be kicked off to start a new import

The cron ensures that imports will keep trying, especially while we are starting out and have everything throttled down using the various feature flags and application settings in ContainerRegistry::Migration.

There are many calls to methods in ::ContainerRegistry::Migration. These all are checking feature flag and application setting values. Since we are using a fairly large number of settings and feature flags to control the import rollout, they have been centralized to a single class to keep things organized.

🐘 Database

Queries

This MR introduces 4 new scopes that in turn make up 4 new queries:

1. ContainerRepository.with_migration_states(%w[pre_importing pre_import_done importing]).count

Query:
SELECT COUNT(*) 
FROM "container_repositories" 
WHERE "container_repositories"."migration_state" 
   IN ('pre_importing', 'pre_import_done', 'importing');

Explain:

Note: currently all container repositories have a 'default' migration_state, so in adding the index and updating some values on postgres.ai, we cannot achieve a cold-cache query. In addition to seeing that we have a better explain plan using the new index, the thing to notice for this and all of the queries using the new index is that the total number of buffers (hits + read) is much lower.

2. ContainerRepository.recently_done_migration_step.first

This query uses a new index index_container_repositories_on_greatest_done_at

Query:
SELECT "container_repositories".* 
FROM "container_repositories" 
WHERE "container_repositories"."migration_state" IN ('import_done', 'pre_import_done', 'import_aborted') 
ORDER BY GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at) DESC 
LIMIT 1;

To set up some data for this query in postgres.ai:

UPDATE container_repositories SET migration_state = 'import_done',
migration_import_done_at = (
  select timestamp '2020-01-10 00:00:00' + random() * (timestamp '2022-01-01 00:00:00' - timestamp '2020-01-01 00:00:00')
) WHERE id % 100 = 0;

UPDATE container_repositories SET migration_state = 'pre_import_done',
migration_pre_import_done_at = (
  select timestamp '2020-01-10 00:00:00' + random() * (timestamp '2022-01-01 00:00:00' - timestamp '2020-01-01 00:00:00')
) WHERE id % 425 = 0;

UPDATE container_repositories SET migration_state = 'import_aborted',
migration_aborted_at = (
  select timestamp '2020-01-10 00:00:00' + random() * (timestamp '2022-01-01 00:00:00' - timestamp '2020-01-01 00:00:00')
) WHERE id % 900 = 0;

Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8442/commands/29887

3. ContainerRepository.with_migration_state('import_aborted').take

Query:
SELECT "container_repositories".* 
FROM "container_repositories" 
WHERE "container_repositories"."migration_state" = 'import_aborted' 
LIMIT 1

4. ContainerRepository.ready_for_import.take

Explain:

The .ready_for_import scope contains .with_target_import_tier which is overwritten in EE and there is additionally a feature flag that can effect how the query is formed. Performance benefits greatly since we have no ORDER and only need 1 record LIMIT 1.

Note that the EE permutations have a guard so they will only execute on GitLab.com.

Here is each permutation with notes about when and how often they will be used:

EE - .with_target_import_tier filters by plan name

This query occurs when the feature flag :container_registry_migration_limit_gitlab_org is disabled. This is the most complicated query (most joins and filters). This is the query that will be used the majority of the time for the GitLab.com migration.

This query might benefit from an index since over time, the migration_state of the container repositories will move from default to import_done, but I didn't want to pre-maturely add the index. I'm open to looking further into it if there are any specific ideas.

SELECT "container_repositories".* 
FROM "container_repositories" 
INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" 
INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" 
INNER JOIN "gitlab_subscriptions" ON "gitlab_subscriptions"."namespace_id" = "namespaces"."id" 
INNER JOIN "plans" ON "plans"."id" = "gitlab_subscriptions"."hosted_plan_id" 
WHERE "container_repositories"."migration_state" = 'default' 
AND "container_repositories"."created_at" < '2022-01-01 00:00:00' 
AND "plans"."name" = 'free' 
AND (
  NOT EXISTS (
        SELECT 1
        FROM feature_gates
        WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
        AND feature_gates.key = 'actors'
        AND feature_gates.value = concat('Group:', projects.namespace_id)
  )
) LIMIT 1;

Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8310/commands/29375

EE - .with_target_import_tier filters repositories for `gitlab-org` group

This query occurs when the feature flag :container_registry_migration_limit_gitlab_org is enabled. This will be used to allow us to start by only importing container repositories belonging to the gitlab-org group.

SELECT "container_repositories".* 
FROM "container_repositories" 
INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" 
INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" 
WHERE "container_repositories"."migration_state" = 'default' 
AND "container_repositories"."created_at" < '2022-01-01 00:00:00' 
AND "namespaces"."path" = 'gitlab-org' 
AND (
  NOT EXISTS (
        SELECT 1
        FROM feature_gates
        WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
        AND feature_gates.key = 'actors'
        AND feature_gates.value = concat('Group:', projects.namespace_id)
  )
) LIMIT 1

Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8411/commands/29674

FOSS - .with_target_import_tier` returns `all`

This is the least complicated query (least joins and filters). This is what will be run on self-managed instances that use the import process.

SELECT "container_repositories".* 
FROM "container_repositories" 
INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" 
WHERE "container_repositories"."migration_state" = 'default' 
AND "container_repositories"."created_at" < '2022-01-23 00:00:00' 
AND (
  NOT EXISTS (
        SELECT 1
        FROM feature_gates
        WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
        AND feature_gates.key = 'actors'
        AND feature_gates.value = concat('Group:', projects.namespace_id)
  )
) LIMIT 1;

Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8306/commands/29333

Migrations

Migration output
 bundle exec rails db:migrate:redo
== 20220128194722 AddIndexOnMigrationStateAndImportDoneAtToContainerRepositories: reverting
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_repositories)
   -> 0.0052s
-- execute("SET statement_timeout TO 0")
   -> 0.0007s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"index_container_repositories_on_migration_state_import_done_at"})
   -> 0.0064s
-- execute("RESET statement_timeout")
   -> 0.0007s
== 20220128194722 AddIndexOnMigrationStateAndImportDoneAtToContainerRepositories: reverted (0.0154s)

== 20220128194722 AddIndexOnMigrationStateAndImportDoneAtToContainerRepositories: migrating
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:migration_state, :migration_import_done_at], {:name=>"index_container_repositories_on_migration_state_import_done_at", :algorithm=>:concurrently})
   -> 0.0068s
-- execute("SET statement_timeout TO 0")
   -> 0.0006s
-- add_index(:container_repositories, [:migration_state, :migration_import_done_at], {:name=>"index_container_repositories_on_migration_state_import_done_at", :algorithm=>:concurrently})
   -> 0.0082s
-- execute("RESET statement_timeout")
   -> 0.0011s
== 20220128194722 AddIndexOnMigrationStateAndImportDoneAtToContainerRepositories: migrated (0.0211s)
 bundle exec rake db:redo
== 20220204154220 AddIndexOnGreatestDoneAtToContainerRepositories: reverting ==
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_repositories)
   -> 0.0057s
-- execute("SET statement_timeout TO 0")
   -> 0.0009s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"index_container_repositories_on_greatest_done_at"})
   -> 0.0066s
-- execute("RESET statement_timeout")
   -> 0.0006s
== 20220204154220 AddIndexOnGreatestDoneAtToContainerRepositories: reverted (0.0185s)

== 20220204154220 AddIndexOnGreatestDoneAtToContainerRepositories: migrating ==
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, "GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at)", {:where=>"migration_state IN ('import_done', 'pre_import_done', 'import_aborted')", :name=>"index_container_repositories_on_greatest_done_at", :algorithm=>:concurrently})
   -> 0.0061s
-- execute("SET statement_timeout TO 0")
   -> 0.0006s
-- add_index(:container_repositories, "GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at)", {:where=>"migration_state IN ('import_done', 'pre_import_done', 'import_aborted')", :name=>"index_container_repositories_on_greatest_done_at", :algorithm=>:concurrently})
   -> 0.0153s
-- execute("RESET statement_timeout")
   -> 0.0007s
== 20220204154220 AddIndexOnGreatestDoneAtToContainerRepositories: migrated (0.0321s)

📸 Screenshots or screen recordings

See below

💻 How to set up and validate locally

We cannot fully test the functionality because the import functionality is still being developed in the Container Registry, so any requests to import a repository will result in an error. This does mean, however, we can test that imports are aborted properly and follow the application settings and feature flags in place.

  1. Set up the feature flags:

    Feature.enable(:container_registry_migration_phase2_enabled)
    
    Feature.enable(:container_registry_migration_phase2_capacity_1)
    
    Feature.disable(:container_registry_migration_phase2_enqueue_speed_fast)
    Feature.disable(:container_registry_migration_phase2_enqueue_speed_slow)
    
  2. Create some container repositories in the console and set them to be created a few months ago so they qualify for import:

    10.times { FactoryBot.create(:container_repository, project: Project.first) }
    ContainerRepository.update_all(created_at: 3.months.ago)
    ContainerRepository.where(migration_state: 'default').count # => 10
  3. Run the worker

    ContainerRegistry::Migration::EnqueuerWorker.set(queue: 'cronjob:container_registry_migration_enqueuer').perform_in(1.second)
  4. Check the container repositories, the first one should have been aborted

    ContainerRepository.where(migration_state: 'default').count # => 9
    ContainerRepository.where.not(migration_state: 'default').first.migration_state
    # => "import_aborted"
    # Since the registry cannot be connected to in these tests, we receive an error and the import is aborted
  5. Set the first repository as recently imported:

    ContainerRepository.first.update(migration_state: 'import_done', migration_import_done_at: 5.minutes.go)
  6. Rerun the worker and see no repositories are updated:

    ContainerRegistry::Migration::EnqueuerWorker.set(queue: 'cronjob:container_registry_migration_enqueuer').perform_in(1.second)
    ContainerRepository.where(migration_state: 'default').count # => 9
  7. Update the waiting time feature flag:

    Feature.enable(:container_registry_migration_phase2_enqueue_speed_fast)
  8. Rerun the worker and see another repository has been updated

    ContainerRegistry::Migration::EnqueuerWorker.set(queue: 'cronjob:container_registry_migration_enqueuer').perform_in(1.second)
    ContainerRepository.where(migration_state: 'default').count # => 8

📐 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #349744 (closed)

Edited by Steve Abrams

Merge request reports