Update EnqueuerWorker to handle nil time checks (!84315) · Merge requests · GitLab.org / GitLab

🏛 Context

We are preparing to migrate all existing container repositories to the new container registry. This process is driven by rails.

The main driver of the process is the EnqueuerWorker, which finds the next container repository qualified for import, and then sends an API request to the registry to start the import.

We have a variety of settings and feature flags in place so we can control which container repositories are being imported, and how fast they are being imported. One such setting allows us to say how quickly we should start a new import after a previous one has been completed. We use a few feature flags to say "start a new import if it has been x time since the last one finished" where x can be 1 hour, 6 hours, or 0 which would be continuous imports.

There is a bug where we might look at the last completed import and the timestamp we use to see if we should import or return early could be nil. This leads to a loop where we return early, then the EnqueuerWorker runs again, and returns early again and so on, never starting the next import.

💡 What does this MR do and why?

This MR updates a few places to prevent the nil timestamps from preventing imports moving forward. We:

Update the EnqueuerWorker to check for the presence of last_import_step_done_at. If it is nil, we allow the Enqueuer to continue.
Update last_import_step_done_at to also include import_skipped_at when looking for the most recently completed import. Skipped imports have completed their path so we can start the next import after it has completed.
To ensure we don't prevent timestamps from properly updating, update all before_transitions in the migration_state state machine so any state can be the starting state. In testing we have seen cases where the incorrect state was the starting state and this caused the timestamp to be nil.

I have changed the recently_done_migration_step scope, which is a query change, but it only changes the underlying query condition, not the query plan itself. Thus I have opted not to request a database review. Please feel free to request one if you feel it should be requested.

🐘 Database

Query

The query that uses the updated scope recently_done_migration_step is:

ContainerRepository.recently_done_migration_step.first

SELECT "container_repositories".* 
  FROM "container_repositories" 
WHERE "container_repositories"."migration_state" 
   IN ('import_done', 'pre_import_done', 'import_aborted', 'import_skipped') 
ORDER BY GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at, migration_skipped_at) DESC LIMIT 1;

We have an index covering this query that has been updated. All records on production have a migration_state of 'default', so we have to do some setup on postgres.ai to show the query will still properly use the index:

Setup:

-- Create new index
exec CREATE INDEX index_container_repositories_on_greatest_completed_at ON container_repositories USING btree (GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at, migration_skipped_at)) WHERE (migration_state = ANY (ARRAY['import_done'::text, 'pre_import_done'::text, 'import_aborted'::text, 'import_skipped'::text]));

-- Update some records to include the timestamps and states used in the query
exec update container_repositories set migration_import_done_at = NOW(), migration_state = 'import_done' where id in (111, 2222, 3434, 51152, 397573, 427209);

exec update container_repositories set migration_skipped_at = NOW(), migration_state = 'import_skipped' where id in (123, 1234, 4444, 51252, 398573, 227209);

Explain Plan: 3.133 ms https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9667/commands/34294

This is a warm cache explain plan. We are unable to get a cold cache plan because we are adding the index and updating the data right before executing the query. However, the explain plan looks good with and Index Scan, and we see a very low number of buffers which would be similar even with a cold cache query.

Migration

DB testing output - migration time - 4.1s

Up 🛫

== 20220408135815 UpdateIndexOnGreatedDoneAtOnContainerRepositories: migrating
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, "GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at, migration_skipped_at)", {:where=>"migration_state IN ('import_done', 'pre_import_done', 'import_aborted', 'import_skipped')", :name=>"index_container_repositories_on_greatest_completed_at", :algorithm=>:concurrently})
   -> 0.0077s
-- execute("SET statement_timeout TO 0")
   -> 0.0007s
-- add_index(:container_repositories, "GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at, migration_skipped_at)", {:where=>"migration_state IN ('import_done', 'pre_import_done', 'import_aborted', 'import_skipped')", :name=>"index_container_repositories_on_greatest_completed_at", :algorithm=>:concurrently})
   -> 0.0090s
-- execute("RESET statement_timeout")
   -> 0.0006s
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_repositories)
   -> 0.0049s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"index_container_repositories_on_greatest_done_at"})
   -> 0.0054s
== 20220408135815 UpdateIndexOnGreatedDoneAtOnContainerRepositories: migrated (0.0413s)

Down 🛬

== 20220408135815 UpdateIndexOnGreatedDoneAtOnContainerRepositories: reverting
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, "GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at)", {:where=>"migration_state IN ('import_done', 'pre_import_done', 'import_aborted')", :name=>"index_container_repositories_on_greatest_done_at", :algorithm=>:concurrently})
   -> 0.0228s
-- execute("SET statement_timeout TO 0")
   -> 0.0022s
-- add_index(:container_repositories, "GREATEST(migration_pre_import_done_at, migration_import_done_at, migration_aborted_at)", {:where=>"migration_state IN ('import_done', 'pre_import_done', 'import_aborted')", :name=>"index_container_repositories_on_greatest_done_at", :algorithm=>:concurrently})
   -> 0.0162s
-- execute("RESET statement_timeout")
   -> 0.0019s
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_repositories)
   -> 0.0206s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"index_container_repositories_on_greatest_completed_at"})
   -> 0.0128s
== 20220408135815 UpdateIndexOnGreatedDoneAtOnContainerRepositories: reverted (0.1088s)

📸 Screenshots or screen recordings

See below

💻 How to set up and validate locally

Enable the necessary feature flags:

Feature.enable(:container_registry_migration_phase2_enabled)
Feature.enable(:container_registry_migration_phase2_enqueue_speed_fast)
Feature.enable(:container_registry_migration_phase2_all_plans)

Create two container repositories. The first will be the next repository to be handled by the Enqueuer, the second will be the last completed repository. Note the ID of the first created container repository so we can check on it later.

FactoryBot.create(:container_repository, project: Project.last, created_at: '2022-01-01') # note the ID of this container repository
FactoryBot.create(:container_repository, :import_done, project: Project.last, created_at: '2022-01-01')

Optional: to be sure that no other previously container repositories conflict when running the worker below, you may want to remove all previous container repositories:

ContainerRepository.where('id < {the ID you noted above}').destroy_all

Update the last completed repository to have nil timestamps:

ContainerRepository.last.update_columns(migration_pre_import_done_at: nil, migration_import_done_at: nil)

Update the #try_import method in app/models/container_repository.rb to return early and skip the registry API request. If you don't do this, you will have to wait about 5 minutes for the job to run because it will continue to try to make requests to the registry.

  def try_import
    abort_import
    return false

    ...
  end

Run the worker:

ContainerRegistry::Migration::EnqueuerWorker.perform_at(1.second)

Check the first repository you created. It should have been processed by the worker and now be in the aborted state:

ContainerRepository.find(<ID that was noted above>).migration_state
=> 'import_aborted'

✏ MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #357498 (closed)

Edited Apr 08, 2022 by Steve Abrams

Update EnqueuerWorker to handle nil time checks