Sidekiq-cluster usage on sidekiq nodes
I just noticed that sidekiq-cluster is being extensively used on sidekiq nodes. There are currently 14 queues in the sidekiq-cluster list while there were only 5 when we moved the fleet just two months ago. This is a big problem as more workers are spawned for each queue and the nodes are seriously struggling with contention.
Back then we sized the new VMs based on the sidekiq configuration and we agreed that 4 cores were enough to sustain the load. This actually worked as planned until we begun adding more and more queues to sidekiq-cluster up to the point where the load graphs for the fleet is an ugly mess of spikes and, unsurprisingly, we're seeing an increase of reports for background processing slowness.
We need to fix this and there are two options in my opinion:
-
We scale up the sidekiq nodes. This would be the quickest fix but also the most expensive. It also won't isolate failure domains meaning that if a queue goes crazy it'll take down everything with it.
-
We split the queues across multiple fleets. This would require more work from the infrastructure point of view but would give us more stability compared to the previous option. We also need to create sensible groups of queues.
My question still remains though and I'd appreciate if someone could give a detailed answer: why do we need to rely on a secondary sidekiq instance to deal with our queues? What can we improve to make sidekiq able to process our queues without making any distinctions? If we reached its limits, does it make sense to look for a replacement?
/cc @gl-infra