Sleep instead of scheduling Sidekiq jobs in the future (!74532) · Merge requests · GitLab.org / GitLab

Heinrich Lee Yu requested to merge sleep-instead-of-scheduling-jobs-in-the-future into master Nov 15, 2021

What does this MR do and why?

When handling Sidekiq workers that use replicas, we add a sleep in the worker when the minimum delay is not yet reached. This replaces the strategy we do now where we schedule jobs 1 second in the future.

We are testing this because Sidekiq scheduled set processing can be expensive and the latencies are high.

More details in gitlab-com/gl-infra/scalability#1380 (closed)

How to set up and validate locally

I tested locally by enabling the feature flag and comparing ExpirePipelineCacheWorker.perform_in(1.second, 1) and ExpirePipelineCacheWorker.perform_async(1). You can also disable the FF and do perform_async and it should do the same as the former one.

When scheduling in the future, we can see in the logs that duration_s is pretty fast at ~200 ms but enqueue_latency_s is ~3 seconds. This is because even though we're scheduling 1 second in the future, we'd have to wait for the time Sidekiq polls and processes the scheduled set. On GitLab.com the latency is ~20 seconds because we have so many scheduled jobs.

With the sleep strategy, there is no enqueue_latency_s because it is pushed to the queue directly. We do see duration_s increase because of the sleep but this is expected. The increase is high when the worker isn't doing anything because the job is popped immediately. But this should be lower as the worker is doing some work because there will naturally be some delay from when a job is in the queue and when it is popped. We measure this in scheduling_latency_s.

Also, when the worker has some free threads, the sleep and increase in duration should not be a problem. This will only lower throughput if all of the threads are doing work. We plan to roll this out incrementally and see how it affects job throughput.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Nov 16, 2021 by Heinrich Lee Yu

Sleep instead of scheduling Sidekiq jobs in the future

What does this MR do and why?

How to set up and validate locally

MR acceptance checklist

Merge request reports