Enqueuer job: fix the re enqueue (!83527) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 356433-fix-re-enqueue-in-enqueuer into master Mar 23, 2022

💈 Context

We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend.

At the core of the rails part lies the Enqueuer worker. Its responsibility is: find the next eligible image repository to migrate and call the container registry to start/retry the migration.

To make things 🚀, we implemented the concept of capacity. Those are like slots that ongoing migrations can take. For example, let's say we have a capacity of 10. The Enqueuer has to start the migration on 10 image repositories.

(A) How we achieve that? Simply by re enqueuing a job at the end of the Enqueuer #perform if the current load and the current capacity allows it. Using our example again, the Enqueuer will "chain" 9 executions after the first one.

(B) In !83091 (merged), we extended the deduplication with until_executed. The reason was that we noticed on staging that we could have situations where multiple jobs could inter weave their executions (See #356130 (closed)) and that's not something we want.

Now combine (B) with (A) and what happens? Well, simple: the re_enqueue is executed but because deduplication until_executed is in place = that re_enqueue is immediately rejected = the Enqueuer is not properly filling ongoing migrations until reaching capacity = it's like we have a forced capacity of 1. This is issue #356433 (closed).

This MR tries to fix the situation with:

Use until_executing for the deduplication so that the re_enqueue happening at the end of the #perform is successful.
Use an exclusive lease so that we guarantee that two parallel jobs can't run together. One of them will simply end with a no op.

🔬 What does this MR do and why?

Use until_executing deduplication
Use an exclusive lease in the #perform function
Update the related spec

🖼 Screenshots or screen recordings

n / a

📸 How to set up and validate locally

We can't really validate the until_executing deduplication but we can check the exclusive lease usage.

Update the #perform method to:

def perform
  try_obtain_lease do
    sleep 60 * 5
  end
end

Update the background logs script (Procfile in GDK) to have SIDEKIQ_WORKERS=2 and make sure that when you start your background workers, you get:
```
Starting cluster with 2 processes
```
- This is important as the first process will handle the first job and sleep for 5 minutes.
Tail the background jobs logs:
```
$ gdk tail rails-background-jobs
```

In a rails console, enqueue the first job

ContainerRegistry::Migration::EnqueuerWorker.perform_async
=> "ae41adf62044c8a8456db633"

Wait for the start message:

{"severity":"INFO","time":"2022-03-23T15:35:12.234Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"ae41adf62044c8a8456db633","job_status":"start"}

Enqeue the second job:

ContainerRegistry::Migration::EnqueuerWorker.perform_async
=> "f2758b4400237162ab8bac75"

The second job immediately ends because of the lease taken:

{"severity":"INFO","time":"2022-03-23T15:36:35.608Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"f2758b4400237162ab8bac75","job_status":"start",}
{"severity":"INFO","time":"2022-03-23T15:36:36.177Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"f2758b4400237162ab8bac75","message":"ContainerRegistry::Migration::EnqueuerWorker JID-f2758b4400237162ab8bac75: done: 0.568867 sec","job_status":"done",}

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Mar 24, 2022 by David Fernandez

Enqueuer job: fix the re enqueue

💈 Context

🔬 What does this MR do and why?

🖼 Screenshots or screen recordings

📸 How to set up and validate locally

🚥 MR acceptance checklist

Merge request reports