Skip to content

Draft: Sidekiq cluster shuold restart only single sidekiq process, if that process was terminated by our memory killer

What does this MR do and why?

In #388272 (closed), we noticed that it seems wrong to terminate the whole Sidekiq cluster with all other Sidekiq processes, because one Sidekiq process exceeded the RSS threshold and was terminated by our Memory Killer.

Although this is not a problem on SaaS, since we are running a single Sidekiq process per pod, it could be a performance issue for our self-managed customers who are using Sidekiq Cluster to run multiple Siekiq processes.

In this PoC, I tried to use shared IO.pipe, so I can write from Watchdog::Handlers::SidekiqHandler the pid, worker_id and queues for the process that is about to get terminated. Once the SidekiqProcessSupervisor, detects that one of the running processes is not alive, we read from the shared IO pipe, to check if dead pids are actually terminated by our Watchdog. If this is the case, we just spawn a new Sidekiq process for the same queues and with the same worker_id.

Otherwise, in case the process died from the unknown reason, we restart the whole cluster as before.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #388272 (closed)

Edited by Nikola Milojevic

Merge request reports