Fix Guard worker thresholds
🍋 Context
We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend. For all the nitty-gritty details, see &7316 (comment 897867569).
The migration has essentially 2 main steps: pre_import
and import
. Given to restrictions applied when an image repository is in those states, we implemented a worker which main function is: detect stale migrations and abort them.
Within that function, we extended it to detect long running migrations. Basically, we could have situations where the Container Registry is pre_importing
for way too long. In those case, we need to detect those long running migrations and "actively" cancel them. This means that rails will explicitly tell the Container Registry: "hey, you're running this migration for way too long. Stop it now."
To detect those long running migrations, we use thresholds. Originally, we used 10.minutes
(eg. a migration running for 10 minutes+ would be considered as a long running one). As we proceed through the backlog of image repositories to migrate, we started migrating images with a large amount of tags (200+). We noticed that those images had a quite long pre_import
step (as expected). It was so long that we started to hit the Guard threshold.
To fix this, in !87324 (merged) we changed the threshold from a hard coded value to 2 application settings: one for the pre_import
threshold and one for the import
threshold. This way, we can have 2 different values and update them as we move along the migration backlog.
Unfortunately, a small typebug was introduced. The values from the application settings are Integer
and we try to call #ago
on it. What is missing is a call to #seconds
before the #ago
. Guess what happened? Yes,
NoMethodError
undefined method `ago' for 1800:Integer
This MR fixes that and that's exactly issue #362474 (closed).
While at it, we noticed another typebug. The Guard worker has deduplication defined with deduplicate :until_executed
. It seems that the duplication key is left behind when the fail
:
We're planning to open an MR to fix the deduplication logic but in the meantime, we can also set a shorter ttl on the deduplication of the guard worker.
🔬 What does this MR do and why?
- Update the Guard worker to convert application settings values to seconds while reading thresholds.
- Update the Guard worker to have a deduplication ttl of
5.minutes
. - Update the related spec.
- Stub application settings with
Integer
values and notActiveSupport::Duration
.
- Stub application settings with
🖼 Screenshots or screen recordings
n / a
⚙ How to set up and validate locally
See !80502 (merged)
🚦 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.