Skip to content

Draft: Federate sidekiq redis access by synchronising access to `Sidekiq.redis`

What does this MR do and why?

Issue: gitlab-com/gl-infra/scalability#1453

This draft MR builds on !94856 and explores possibility of performing federated sidekiq API operations (i.e. broadcast sidekiq API operations to all zonal redises). An example is finding the total queue length. Presently, checking a single redis will suffice but a zonal sidekiq setup will require the application to sum the length of the same queue over 3 different redis (see example).

Sidekiq.redis is a singleton method which is globally shared and accessed by all sidekiq APIs (e.g. Sidekiq::Queue, Sidekiq::ProcessSet). The self.redis= public method in the Sidekiq module allows user to switch redis pools, opening up the possibility of broadcasting operations across n zones. See source

However, there are a few considerations

  1. Performance drawbacks, other threads accessing Sidekiq.redis may be blocked until the mutex is released, even if it is their turn to perform work.
  2. Onus on developers to wrap all Sidekiq.redis or Sidekiq::XX API operations with a FederatedSidekiq.lock for safe access.
  3. Deadlock possibility (will need to explore ways to timeout and release the lock)

Note: "Safe access" refers to Sidekiq.redis being set to the right instance when sidekiq APIs are invoked. "Unsafe access" refers to possibility where an application in zone A has its Sidekiq.redis set to the redis instance in zone B due to a data race.

Being able to safely switch Sidekiq.redis_pool between zonal redises allows us to leverage Sidekiq APIs without having to clobber together adhoc workarounds. The developer needs to do 2 things:

  1. Wrap operations within FederatedSidekiq.lock (alternatively, place the lock block within Gitlab::Redis::Queues.with_each_pool so the mutex is acquired on a per-zonal-operation basis to avoid holding the lock for extended periods of time. This will need more investigation/validation).
  2. Handle aggregation of outputs (e.g. sum queue length, average latency, etc)

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

TODO: Local setup and verify. Test for concurrent access.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Sylvester Chin

Merge request reports