Guard worker: don't abort ongoing migrations
🌵 Context
We're currently migrating data in the Container Registry. Given the amount of data to migrate, the whole migration is driven by the rails backend. Basically, the rails backend is responsible to pick the next eligible container image for the migration and instruct the Container Registry to start the migration.
The migration being a two step process, the rails backend has a state_machine
associated with the migration status so that it can follow the steps that are completed.
Now, given that interactions between the rails backend and the Container Registry can't be reliable (network timeouts or emergency reboots), the migration statuses in the Container Registry and the rails backend can be out of sync.
In order, to handle those situations, in !79634 (merged), we created a background job, calld Guard, that will constantly monitor the ongoing migrations and will try to fix stale migrations. To fix them, right now, it's super simple: we simply abort the migration (and it will get retried at a later time).
In Bypass ongoing imports in the Guard worker (#352562 - closed), we suggested to ping the Container Registry for those stale migrations and if the Container Registry replies that the step is still ongoing and the status on the rails side is coherent, we don't abort the migration. This is to avoid useless ping pongs such as importing
-> aborted
-> importing
.
This is what this MR updates.
🤔 What does this MR do and why?
- Update the Container Registry Migration Guard job
- Check the actual migration status for ongoing steps.
- If the step is still ongoing and the migration state on the rails side is coherent with that, do nothing.
- Detect long running (30min+) stale migrations.
- Check the actual migration status for ongoing steps.
- Update the related specs.
- Given that the Guard job has been added in %14.8 and this is a follow fix, no changelog has not been added.
🖼 Screenshots or screen recordings
n / a
⚙ How to set up and validate locally
-
⚠ Given that the Container Registry side is still being worked on, we're going to modifyContainerRegistry::GitlabApiClient#import_status
to directly return possible import status. Example:def import_status(path) 'pre_import_in_progress' # body_hash = response_body(faraday.get(import_url_for(path))) # body_hash['status'] || 'error' end
- Have some capacity in the migration queue:
Feature.enable(:container_registry_migration_phase2_capacity_25)
- Disable the
.com?
check done by the guard worker
1️⃣ Pre importing image and container registry replies pre_import_in_progress
- Let's create a container repository in the right migration status
image = FactoryBot.create(:container_repository, :pre_importing, project: Project.last)
- Let's make it stale
image.update!(migration_pre_import_started_at: 1.hour.ago)
- Run the guard job
ContainerRegistry::Migration::GuardWorker.new.perform
- Check the status
image.reload.migration_state # "pre_importing"
2️⃣ Pre importing image and the container registry replies anything else
- Let's create a container repository in the right migration status
image = FactoryBot.create(:container_repository, :pre_importing, project: Project.last)
- Let's make it stale
image.update!(migration_pre_import_started_at: 1.hour.ago)
- Run the guard job
ContainerRegistry::Migration::GuardWorker.new.perform
- Check the status
image.reload.migration_state # "import_aborted"
3️⃣ Importing image and container registry replies import_in_progress
- Let's create a container repository in the right migration status
image = FactoryBot.create(:container_repository, :importing, project: Project.last)
- Let's make it stale
image.update!(migration_import_started_at: 1.hour.ago)
- Run the guard job
ContainerRegistry::Migration::GuardWorker.new.perform
- Check the status
image.reload.migration_state # "importing"
4️⃣ Importing image and the container registry replies anything else
- Let's create a container repository in the right migration status
image = FactoryBot.create(:container_repository, :importing, project: Project.last)
- Let's make it stale
image.update!(migration_import_started_at: 1.hour.ago)
- Run the guard job
ContainerRegistry::Migration::GuardWorker.new.perform
- Check the status
image.reload.migration_state # "import_aborted"
5️⃣ Any other case
- Let's create a container repository in any other migration status
image = FactoryBot.create(:container_repository, :pre_import_done, project: Project.last)
- Let's make it stale
image.update!(migration_pre_import_done_at: 1.hour.ago)
- Run the guard job
ContainerRegistry::Migration::GuardWorker.new.perform
- Check the status
image.reload.migration_state # "import_aborted"
🏁 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.