Enqueuer job: fix the re enqueue
💈 Context
We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend.
At the core of the rails part lies the Enqueuer worker. Its responsibility is: find the next eligible image repository to migrate and call the container registry to start/retry the migration.
To make things capacity. Those are like slots that ongoing migrations can take. For example, let's say we have a capacity of 10. The Enqueuer has to start the migration on 10 image repositories.
(A) How we achieve that? Simply by re enqueuing a job at the end of the Enqueuer #perform if the current load and the current capacity allows it. Using our example again, the Enqueuer will "chain" 9 executions after the first one.
(B) In !83091 (merged), we extended the deduplication with until_executed. The reason was that we noticed on staging that we could have situations where multiple jobs could inter weave their executions (See #356130 (closed)) and that's not something we want.
Now combine (B) with (A) and what happens? Well, simple: the re_enqueue is executed but because deduplication until_executed is in place = that re_enqueue is immediately rejected = the Enqueuer is not properly filling ongoing migrations until reaching capacity = it's like we have a forced capacity of 1. This is issue #356433 (closed).
This MR tries to fix the situation with:
- Use
until_executingfor the deduplication so that the re_enqueue happening at the end of the#performis successful. - Use an exclusive lease so that we guarantee that two parallel jobs can't run together. One of them will simply end with a no op.
🔬 What does this MR do and why?
- Use
until_executingdeduplication - Use an exclusive lease in the
#performfunction - Update the related spec
🖼 Screenshots or screen recordings
n / a
📸 How to set up and validate locally
We can't really validate the until_executing deduplication but we can check the exclusive lease usage.
- Update the
#performmethod to:def perform try_obtain_lease do sleep 60 * 5 end end - Update the background logs script (
Procfilein GDK) to haveSIDEKIQ_WORKERS=2and make sure that when you start your background workers, you get:Starting cluster with 2 processes- This is important as the first process will handle the first job and sleep for 5 minutes.
- Tail the background jobs logs:
$ gdk tail rails-background-jobs - In a rails console, enqueue the first job
ContainerRegistry::Migration::EnqueuerWorker.perform_async => "ae41adf62044c8a8456db633" - Wait for the
startmessage:{"severity":"INFO","time":"2022-03-23T15:35:12.234Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"ae41adf62044c8a8456db633","job_status":"start"} - Enqeue the second job:
ContainerRegistry::Migration::EnqueuerWorker.perform_async => "f2758b4400237162ab8bac75" - The second job immediately ends because of the lease taken:
{"severity":"INFO","time":"2022-03-23T15:36:35.608Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"f2758b4400237162ab8bac75","job_status":"start",} {"severity":"INFO","time":"2022-03-23T15:36:36.177Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"f2758b4400237162ab8bac75","message":"ContainerRegistry::Migration::EnqueuerWorker JID-f2758b4400237162ab8bac75: done: 0.568867 sec","job_status":"done",}
🚥 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.