Skip to content

Use concurrency limiter for job deferring

Overview

Instead of using perform_at which sends the job back into a sorted set, we could consider using the concurrency limiter middleware and route the job into a buffer.

  1. Storing the job in a Redis list is computationally cheaper

By storing the job in the schedule sorted set, we increase the work done by the Sidekiq::Schedule::Enq class. This work is fairly redundant since the workers get sent back to the sorted set.

I've observed in the past incidents, we often defer jobs fully until the Sidekiq and Patroni apdex recovers before either slowly releasing the jobs back into circulation or dropping them.

  1. Separate the job buffer from redis-sidekiq

This lets us handle excess volume in a separate datastore (redis-cluster-shared-state) without placing additional pressure on the Redis meant for sidekiq operations.

Presently, the buffer is a single key per worker which means if only 1 worker is buffered, we are using additional memory only on 1 Redis node, rather than spreading it across the cluster.

  1. Unify the flow-controlling layer of sidekiq to the concurrency limiter.

We currently have a bunch of options to control Sidekiq job enqueues

Proposal

Building on the the idea proposed in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3775#note_2089502716, we could make use of the concurrency limiter middleware to perform job deferment instead of using skip jobs.

We can retain the existing feature flags but modify the effects to set concurrency limits:

  • -1 when the defer feature flag is disabled
  • 0 when the defer feature flag is enabled fully
  • (this is the tricky part) any % of the feature flag should map to a concurrency limit that is more relaxed. maybe 1% enabled = concurrency level of 1?

Similar to https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2384#note_1426398596

Considerations

This means the ConcurrencyLimit::ResumeWorker cron job should be placed in a separate queue and ideally not impacted by a backed up queue, otherwise, it will not be able to release jobs from the buffer back into the sidekiq queues.

Though from past incidents, we typically release jobs only after the sidekiq queues are no longer piled up.

Edited by Sylvester Chin
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information