Skip to content

Fix Guard worker thresholds

David Fernandez requested to merge 362474-fix-guard-worker-thresholds into master

🍋 Context

We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend. For all the nitty-gritty details, see &7316 (comment 897867569).

The migration has essentially 2 main steps: pre_import and import. Given to restrictions applied when an image repository is in those states, we implemented a worker which main function is: detect stale migrations and abort them.

Within that function, we extended it to detect long running migrations. Basically, we could have situations where the Container Registry is pre_importing for way too long. In those case, we need to detect those long running migrations and "actively" cancel them. This means that rails will explicitly tell the Container Registry: "hey, you're running this migration for way too long. Stop it now."

To detect those long running migrations, we use thresholds. Originally, we used 10.minutes (eg. a migration running for 10 minutes+ would be considered as a long running one). As we proceed through the backlog of image repositories to migrate, we started migrating images with a large amount of tags (200+). We noticed that those images had a quite long pre_import step (as expected). It was so long that we started to hit the Guard threshold.

To fix this, in !87324 (merged) we changed the threshold from a hard coded value to 2 application settings: one for the pre_import threshold and one for the import threshold. This way, we can have 2 different values and update them as we move along the migration backlog.

Unfortunately, a small typebug was introduced. The values from the application settings are Integer and we try to call #ago on it. What is missing is a call to #seconds before the #ago. Guess what happened? Yes, 💥

NoMethodError
undefined method `ago' for 1800:Integer

This MR fixes that and that's exactly issue #362474 (closed).


While at it, we noticed another typebug. The Guard worker has deduplication defined with deduplicate :until_executed. It seems that the duplication key is left behind when the fail:

Screenshot_2022-05-16_at_11.17.59

We're planning to open an MR to fix the deduplication logic but in the meantime, we can also set a shorter ttl on the deduplication of the guard worker.

🔬 What does this MR do and why?

  • Update the Guard worker to convert application settings values to seconds while reading thresholds.
  • Update the Guard worker to have a deduplication ttl of 5.minutes.
  • Update the related spec.
    • Stub application settings with Integer values and not ActiveSupport::Duration.

🖼 Screenshots or screen recordings

n / a

How to set up and validate locally

See !80502 (merged)

🚦 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports