Skip to content

Delete container repository worker: add registry migration support

🃏 Context

We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend. For all the nitty-gritty details, see &7316 (comment 897867569).

The migration is done in two main steps (pre import and import). Between those steps, there are some interactions between the rails backend (the driver) and the container registry (the executor).

We started the migration on small image repositories (low tags count) but as we accept larger and larger image repositories (higher tags count), we noticed that the pre import step execution on the Container Registry took more and more time to complete. For the details see https://dashboards.gitlab.net/d/registry-migration/registry-migration-detail?orgId=1&viewPanel=127&from=now-24h&to=now. We can observe pre import step that last for almost 3 hours.

The issue we're facing is that during that 3 hours, users can decide to destroy the image (along with all the tags). Now, destroying an image is not a lightweight operation. As such, it is delegated to a background job. That job could run during those long pre import steps. This causes problems on the Container Registry side: at the end of the pre import stop, it will notify rails of the pre import end and rails will simply reply 404. This leaves the image repository on the Container Registry side in an unwanted state. On top of this, the notification is not reliable by definition so that request that receives 404 could very well never happen.

The Container Registry side is discussing what to do with these image repositories. Meanwhile, we could improve the background job: if an ongoing migration is detected, don't delete the image repository and re_enqueue the job for a later time.

For the "later time" part, we're going to be smart here. In !88292 (merged), we added some pieces that estimates the pre import and the import steps execution time. We're going to re-use those pieces here to estimate when is a good time to re_enqueue the destroy job.

This is issue #361330 (closed).

What does this MR do and why?

  • Update DeleteContainerRepositoryWorker to take into account migration states.
  • Move the pieces that provide the execution time estimations so that they are easily reachable (class methods on the ContainerRegistry::Migration).
  • Update the related specs.
    • On the worker spec, take this opportunity to do some house keeping.
  • Gate this change behind a feature flag : registry_migration_guard_dynamic_pre_import_timeout.
    • From the past fixes, we had cases where the fixes behaved in an unexpected way and it forced us to pause the overall migration. We don't want this here. The feature flag acts as a switch that we can toggle off to go back to the old implementation.
    • Rollout issue: #350543

The registry migration is currently ongoing for gitlab.com only. As such, it is gated behind several feature flag.

🖥 Screenshots or screen recordings

n / a

How to set up and validate locally

  1. Have GDK ready with registry support.
  2. Enable the destroy image repository worker support:
    Feature.enable(:container_registry_migration_phase2_delete_container_repository_worker_support)
  3. Because we're going to use dummy images, we need to stub the ContainerRepository#tags_count function:
    def tags_count
      60
    end
  4. It's a bit hard to detect a re_enqueue so let's put a dummy log in DeleteContainerRepositoryWorker around L26:
    if migration.delete_container_repository_worker_support? && migrating?
      Rails.logger.debug('🔁')
      self.class.perform_in(migration_duration.from_now)
      return
    end

Let's try to run the job with an image in the default migration state (no ongoing migration):

  1. Create a dummy image:
    repo = FactoryBot.create(:container_repository, project: Project.first)
  2. Let's run the worker:
    DeleteContainerRepositoryWorker.new.perform(User.first.id, repo.id)
  3. repo is destroyed:
    repo.reload # ActiveRecord::RecordNotFound: Couldn't find ContainerRepository
  4. Logs don't show the re_enqueue.

All good 🎉

Let's try with an image that is being migrated. You can use :pre_importing, :pre_import_done or :importing.

  1. Create a dummy image being migrated:
    repo = FactoryBot.create(:container_repository, :pre_importing, project: Project.first)
  2. Let's run the worker:
    DeleteContainerRepositoryWorker.new.perform(User.first.id, repo.id)
    🔁
  3. The re enqueue is logged.
  4. The repo is not destroyed:
    repo.reload # => #<ContainerRepository:0x00007fcc5170aa60 ...>

All good 🎉

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports