Fix UpdateAllMirrorsWorker job tracker usage (!80711) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

While rolling out one of the flag (#351420 (closed)) that enables UpdateAllMirrorsWorker to use job tracker instead of queue size, we noticed a significant RPS drop. Looking back to the implementation (!79097 (diffs)), the semantic change doesn't work as expected. The MR converts UpdateAllMirrorsWorker to depend on processing job count, instead of the number of scheduled jobs in the queue. In theory, those two numbers are very similar, maybe with a small lag. However, when running the following code snippet on Production console, the difference is so huge. Job tracker count never goes down to 0, hence the worker always sleep until timeout.

There are some reason for this:

The queue size is always low because if a new job is likely to be picked up immediately by our Sidekiq cluster. The number of processing job is always high because the RPS of related ProjectImportScheduleWorker is high (200 ops/s).
Plus, the job tracker needs a manual maintenance task, by calling LimitedCapacity::JobTracker#clean_up. This task ensures dead/killed/unexpected terminated jobs are cleaned. Unfortunately, we called the clean up task outside of the sleep loop. The worker is also de-duplicated. Therefore, it turns out the task cleaning is barely called.
Job Tracker uses GitLab::Mirror.available_capacity to limit the number of JIDs that can be tracked. Unfortunately, this number is fluid. It means, the available capacity goes up and down all the time! This may affect the tracking functionality.

In this MR, I'm trying to fix the issue, by:

Track the number of scheduled ProjectImportScheduleWorker jobs, instead of processing jobs. To accomplish this, I move JobTracker#register method call to UpdateAllMirrorsWorker, when a batch of jobs are scheduled. When a job starts processing, it removes its JID from the tracked list. Doing this brings the new number very close to queue actual size. The time difference is the gap after a job is removed from the queue to when it starts. This can be formulated as Redis Networking Roundtrips + Sidekiq middleware latency.
Clean up job tracker right before counting. This is to ensure the number is always up-to-date. In the previous MR, it is tidied once when a UpdateAllMirrorsWorker job is executed. The reality proves that this is not enough.

Screenshots or screen recordings

N/A

How to set up and validate locally

Start one Rails console session (Console A):

job_tracker = LimitedCapacity::JobTracker.new(ProjectImportScheduleWorker.name)
loop do
  puts "job_tracker: #{job_tracker.count}"
  puts "queue_size: #{ProjectImportScheduleWorker.queue_size}"
  sleep 0.5
end

Start another Rails console session (Console B). Run the following command to start an UpdateAllMirrorsWorker:

UpdateAllMirrorsWorker.new.perform_async

Console A prints out the following lines, indicates that job_tracker.count and actual queue size is very close, some seconds apart. Looking at the logs, UpdateAllMirrorsWorker job is blocked until the number drops to 0.

job_tracker: 19
queue_size: 19
job_tracker: 19  <=== Lagged behind
queue_size: 12
job_tracker: 19
queue_size: 12
job_tracker: 18
queue_size: 12
job_tracker: 14 
queue_size: 12 <=== In-synced, 2 seconds later
job_tracker: 12
queue_size: 12
job_tracker: 12
queue_size: 10
job_tracker: 10
queue_size: 9
job_tracker: 9 <=== Lagged behind
queue_size: 5
job_tracker: 8
queue_size: 3
job_tracker: 5 
queue_size: 3
job_tracker: 3 <=== In-synced, 1.5 seconds later
queue_size: 3
job_tracker: 3
queue_size: 3
job_tracker: 3
queue_size: 3
job_tracker: 2
queue_size: 2
job_tracker: 2
queue_size: 2
job_tracker: 1
queue_size: 1
job_tracker: 1
queue_size: 1
job_tracker: 1
queue_size: 0
job_tracker: 0
job_tracker: 19 <==== At this point, UpdateAllMirrorsWorker is rescheduled
queue_size: 19
...

Look at Sidekiq admin dashboard and logs to see UpdateAllMirrorsWorker is rescheduled

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Fix UpdateAllMirrorsWorker job tracker usage

What does this MR do and why?

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

Merge request reports