Improve Sidekiq process hard shutdown (Cluster mode)
What does this MR do?
There is an edge case caught while running sidekiq-cluster
in GDK (through runit) at gitlab-development-kit#844 (moved):
When making a code change in a worker and eventually restarting the rails-background-jobs
process (gdk restart rails-background-jobs
), it's possible that:
- The Cluster process receives a
KILL
fromrunit
(because it timed out) - Because of the change made in the worker, the child Sidekiq process can get stuck, so the
TERM
signal sent at the monitoring thread doesn't take effect - As a final result, we end up with a (stuck) child Sidekiq process without a parent Cluster
- At this point
gdk stop
takes no effect as it can't reach this process -
gdk start
fails to boot therails-background-jobs
(ps aux | grep runsv
showsunable to lock supervise/lock: temporary failure
). So it can quickly become a messy state to be.
One local workaround for that is pkill -9 -f 'runsv'
(which is stuck), and also killing the Sidekiq process.
This MR makes the necessary changes to KILL
the orphaned Sidekiq process if the TERM
can't make it within 5
seconds.
Edited by Oswaldo Ferreira