Skip to content

Update the GuardWorker for registry migration

Steve Abrams requested to merge 359300-guard-worker-updates into master

🌳 Context

We are in the process of migrating all container repositories on GitLab.com to the new container registry. This process is driven by rails.

One part of the process involves a rails worker, the GuardWorker watching the ongoing container repository imports and taking action if they are stuck or there is a problem.

After beginning the migration on production, we encountered some unexpected behavior that we did not see on staging. It is not uncommon for production pods to shutdown. If a container repository happens to be in the middle of pre-importing or importing when this happens, it gets stuck. We experienced this twice within a few hours of monitoring. Once stuck, after 30 minutes, GuardWorker will cancel the import and skip that container repository. This is unfortunate because in this particular case, we could just retry the import and it would succeed. So rather than skip these long running migrations, we would like to retry them a few times before deciding to skip them.

🔎 What does this MR do and why?

  1. We update the GuardWorker so if an import is running too long, we cancel it, but then instead of skipping it, we abort it. This will allow it to be picked back up and retried in the future.

  2. We update the timeout from 30 minutes to 10 minutes. We thought 30 minutes was a good idea to give us plenty of time to react, but it turns out that was too conservative and we do not need to stall the process that long, so we are updating to 10 minutes.

  3. Anytime an import finishes or is aborted, a new one is queued via the EnqueuerWorker. We noticed this does not happen when an import is skipped, which causes the migration process to stop, which we would like to avoid. So here, we now also re-queue when an import is skipped.

  4. We add a new migration_skipped_reason, migration_canceled_by_registry. This will allow us to have a separate skipped reason for when the cancelation happens on the registry side or on the rails side.

  5. Add a new feature flag to allow a capacity of 2. We have a set of feature flags and application settings we are using to control this migration process. Currently, we have the option to set a capacity of 0, 1, 10, and 25. After working with 1 for a while, we decided we would rather not make the jump to 10 yet, but want to see how the system behaves with more than one import running at a time. Adding a capacity of 2 will allow us to do that.

While we have begun the rollout on production, the migration is expected to take multiple months, so it is all behind a feature flag. We will not fully enable the migration until we have properly tested these updates on staging and against the internal gitlab-org container repositories.

Screenshots or screen recordings

n/a

How to set up and validate locally

While minor aspects of the MR are testable locally, making a cancellation request to the registry locally is difficult because currently all new container repositories are automatically added to the new registry, so it requires having old registry data on the newly configured registry. Thus we are doing the testing against staging and gitlab-org data: #350919 (closed).

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related: #359300 (closed)

Edited by Steve Abrams

Merge request reports