Guard job

Context

The context for this issue can be viewed in the epic &7316 (closed) and the breakdown of work can be viewed in this comment &7316 (comment 792633854).

Guard / Watcher

Goal

Watch for how long container repositories have been in (pre)_import state. Detect stale migrations and abort them.

It also acts as the "self heal" component. Container Registry notifications are not guaranteed to be received by Rails. As such, we can miss them and this is a something we need to be prepared for. Imagine that we miss the notification that the import step is done = the container repository stays in read-only mode 😱.

How it is enqueued

This will be a cron worker. Suggested frequency: each 10 minutes.

X still needs to be set but it must be lower than container_registry_max_step_duration

Logic

Loop through states import, pre_import, pre_import_done (order is important here) and loop on each container repositories that have been in those statuses longer than container_registry_max_step_duration. For each container repository:
1. Ping the Container Registry on migration/status
2. For erroneous responses, execute container_repository.abort
  - if migration_retries_count = container_registry_max_retries, execute container_repository.skip
3. For "migration step ongoing response" responses, skip the container repository.
4. For "migration step successful" responses, transition the container repository to the next migration_state.

Notes

I thought about enqueuing this job only when we need to (example, when a container repository is pushed in the pre_importing, enqueue the Watcher job to run in container_registry_max_step_duration). This is nice but the risk in my eyes, is too big. We could miss a container repository in the importing status which is the status where we don't allow write operations.

That risk is too big (imagine a user saying: "I couldn't push to my container repository these last 3 days" 😱). That's why I went with a simpler solution (cron job) where it is guaranteed that this job will run at some point in time.

I would rather spend some backend resources (basically the cron job will have "no op" executions) than missing a container repository in a given migration_state.

Edited Jan 07, 2022 by Tim Rizzi