Skip to content

Address several edge cases in ProcessSupervisor

Matthias Käppler requested to merge 361429-fix-sk-process-supervisor into master

This MR addresses several bugs and warts in ProcessSupervisor that I found during the dev incident we had yesterday. This class runs as a daemon thread in sidekiq-cluster and the Puma primary, and it can inform the caller of changes in process liveness in clustered environments.

Issues I fixed:

  • Allow calling shutdown from supervisor callback; this is what we want for Sidekiq, where if one worker goes away, we want all workers to go away. This requires a small change to how we keep track of the alive state.
  • Swap loop-check with sleep call, since sleep suspends the calling thread, which may lead to the callback being invoked if alive is invalidated by another thread. This also means we don't have to check it anymore in the callbacks.
  • Do not trap INT and TERM by default; this was swallowing these signals for the Puma master.
  • De-duplicate PIDs received from the callback; if only some processes died, then sidekiq-cluster kept passing in the same set of worker IDs back to the supervisor, so they ended up being repeated in the internal list. That doesn't really do any harm, but it can lead to signals being forwarded to processes more than once.

None of these issues are critical bugs, but all are worth fixing IMO.

Related to #361429 (closed)

Edited by Matthias Käppler

Merge request reports