Update sidekiq guides to restrict use of sidekiq APIs
Background
Scalability team currently maintains https://docs.gitlab.com/ee/development/sidekiq/#sidekiq-guides which outline several usage guidelines for developers working with sidekiq-related code. See #1813 (comment 1108277496).
In the process of preparing gitlab application for sidekiq zonal clusters #973, we encountered blockers such as:
- There were instances of worker logic depending on sidekiq-queue size to rate-limit itself( see #1405 (comment 739859655)).
- This is resolved in gitlab-org/gitlab#373648 (closed) and the
queue_size
method removed in gitlab-org/gitlab!98388 (merged).
- This is resolved in gitlab-org/gitlab#373648 (closed) and the
- Application using redis-sidekiq via
Sidekiq.redis
for non-queue-related operations. SidekiqStatus and DuplicateJobs both wrote into redis-sidekiq.
Nonetheless, it raised the question of whether we should have guidelines on using Sidekiq APIs and interacting with Sidekiq.redis
.
Proposal
To restrict the use of sidekiq APIs and Sidekiq.redis
in queue producers (rails app or processes that enqueue new jobs) and consumers (processes that fetch jobs from sidekiq redis). Such usage should be allowed only in the sidekiq middleware layer, migrations and admin operations.
Rubocop rules can be added to enforce this guideline:
- Refrain from using
Sidekiq.redis
except for migration and administration operations where fine-grain control is required. - Same as point 1 but for Sidekiq APIs.
- Same as point 1 but for
Gitlab::Redis::Queues
module
Impacts
"Preserves" primary CPU headroom for redis-sidekiq (immediate)
By keeping only the absolutely necessary operations in redis-sidekiq, we maximise CPU headroom for redis-sidekiq. Everything else (sidekiq status, duplicate jobs, etc) can be performed in other redis shards which can be horizontally scaled with Redis Cluster (note that Redis Cluster is not recommended for sidekiq -- https://github.com/mperham/sidekiq/wiki/Using-Redis#architecture).
For example, migrating DuplicateJobs
will provide substantial Redis primary CPU headroom (&423 (comment 1113572654)).
Greater accuracy in forecast/modelling growth (immediate)
This is a smaller point that is not thoroughly validated yet. But the idea is: if sidekiq-redis only handle queueing operations, there should be a stronger linear relationship between throughput with CPU utilisation %.
The mental model for reasoning around sidekiq would be clearer:
- Queueing-related and sidekiq (retrysets, crons etc) components: redis-sidekiq
- Non-queueing-related components such as gitlab custom logic: redis- (xxx as we have not decided where it should go yet)
Conversely, if only a small subset of processes used Sidekiq APIs or Sidekiq.redis
, this relationship (throughput vs CPU) is harder to model accurately.
Ease of horizontally scaling sidekiq (in future)
Doing so keeps application logic clean and decouples the business logic in the application from the queue system. i.e. when developers implement worker logic, they only implement perform
and set worker attributes which triggers different behaviours in relevant sidekiq middlewares. Concerns such as payload compression, deduplication, retries can be abstracted and handled via SidekiqMiddleware
which we do extensively.
Horizontally scaling the queue system should be easier as a result since changes will be primarily localised to the ApplicationWorker
and relevant sidekiq middlewares.
This is an iffy point since one could say YAGNI (w.r.t horizontally scaling sidekiq). But having such guidelines, keep options open and if we were to revisit &423, the amount of effort in getting the application layer ready should be considerably lower.