Skip to content

Guard worker: don't abort ongoing migrations

David Fernandez requested to merge 352562-dont-abort-ongoing-migrations into master

🌵 Context

We're currently migrating data in the Container Registry. Given the amount of data to migrate, the whole migration is driven by the rails backend. Basically, the rails backend is responsible to pick the next eligible container image for the migration and instruct the Container Registry to start the migration.

The migration being a two step process, the rails backend has a state_machine associated with the migration status so that it can follow the steps that are completed.

Now, given that interactions between the rails backend and the Container Registry can't be reliable (network timeouts or emergency reboots), the migration statuses in the Container Registry and the rails backend can be out of sync.

In order, to handle those situations, in !79634 (merged), we created a background job, calld Guard, that will constantly monitor the ongoing migrations and will try to fix stale migrations. To fix them, right now, it's super simple: we simply abort the migration (and it will get retried at a later time).

In Bypass ongoing imports in the Guard worker (#352562 - closed), we suggested to ping the Container Registry for those stale migrations and if the Container Registry replies that the step is still ongoing and the status on the rails side is coherent, we don't abort the migration. This is to avoid useless ping pongs such as importing -> aborted -> importing.

This is what this MR updates.

🤔 What does this MR do and why?

  • Update the Container Registry Migration Guard job
    • Check the actual migration status for ongoing steps.
      • If the step is still ongoing and the migration state on the rails side is coherent with that, do nothing.
    • Detect long running (30min+) stale migrations.
  • Update the related specs.
  • Given that the Guard job has been added in %14.8 and this is a follow fix, no changelog has not been added.

🖼 Screenshots or screen recordings

n / a

How to set up and validate locally

  1. Given that the Container Registry side is still being worked on, we're going to modify ContainerRegistry::GitlabApiClient#import_status to directly return possible import status. Example:
    def import_status(path)
      'pre_import_in_progress'
      # body_hash = response_body(faraday.get(import_url_for(path)))
      # body_hash['status'] || 'error'
    end
  2. Have some capacity in the migration queue: Feature.enable(:container_registry_migration_phase2_capacity_25)
  3. Disable the .com? check done by the guard worker

1️⃣ Pre importing image and container registry replies pre_import_in_progress

  1. Let's create a container repository in the right migration status
    image = FactoryBot.create(:container_repository, :pre_importing, project: Project.last)
  2. Let's make it stale
    image.update!(migration_pre_import_started_at: 1.hour.ago)
  3. Run the guard job
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Check the status
    image.reload.migration_state # "pre_importing"

2️⃣ Pre importing image and the container registry replies anything else

  1. Let's create a container repository in the right migration status
    image = FactoryBot.create(:container_repository, :pre_importing, project: Project.last)
  2. Let's make it stale
    image.update!(migration_pre_import_started_at: 1.hour.ago)
  3. Run the guard job
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Check the status
    image.reload.migration_state # "import_aborted"

3️⃣ Importing image and container registry replies import_in_progress

  1. Let's create a container repository in the right migration status
    image = FactoryBot.create(:container_repository, :importing, project: Project.last)
  2. Let's make it stale
    image.update!(migration_import_started_at: 1.hour.ago)
  3. Run the guard job
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Check the status
    image.reload.migration_state # "importing"

4️⃣ Importing image and the container registry replies anything else

  1. Let's create a container repository in the right migration status
    image = FactoryBot.create(:container_repository, :importing, project: Project.last)
  2. Let's make it stale
    image.update!(migration_import_started_at: 1.hour.ago)
  3. Run the guard job
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Check the status
    image.reload.migration_state # "import_aborted"

5️⃣ Any other case

  1. Let's create a container repository in any other migration status
    image = FactoryBot.create(:container_repository, :pre_import_done, project: Project.last)
  2. Let's make it stale
    image.update!(migration_pre_import_done_at: 1.hour.ago)
  3. Run the guard job
    ContainerRegistry::Migration::GuardWorker.new.perform
  4. Check the status
    image.reload.migration_state # "import_aborted"

🏁 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports