Address several edge cases in ProcessSupervisor
This MR addresses several bugs and warts in ProcessSupervisor
that I found during the dev
incident we had yesterday. This class runs as a daemon thread in sidekiq-cluster and the Puma primary, and it can inform the caller of changes in process liveness in clustered environments.
Issues I fixed:
- Allow calling
shutdown
from supervisor callback; this is what we want for Sidekiq, where if one worker goes away, we want all workers to go away. This requires a small change to how we keep track of thealive
state. - Swap loop-check with sleep call, since sleep suspends
the calling thread, which may lead to the callback
being invoked if
alive
is invalidated by another thread. This also means we don't have to check it anymore in the callbacks. - Do not trap INT and TERM by default; this was swallowing these signals for the Puma master.
- De-duplicate PIDs received from the callback; if only some processes died, then sidekiq-cluster kept passing in the same set of worker IDs back to the supervisor, so they ended up being repeated in the internal list. That doesn't really do any harm, but it can lead to signals being forwarded to processes more than once.
None of these issues are critical bugs, but all are worth fixing IMO.
Related to #361429 (closed)
Edited by Matthias Käppler