Skip to content

Improve Sidekiq process hard shutdown (Cluster mode)

What does this MR do?

There is an edge case caught while running sidekiq-cluster in GDK (through runit) at gitlab-development-kit#844 (moved):

When making a code change in a worker and eventually restarting the rails-background-jobs process (gdk restart rails-background-jobs), it's possible that:

  1. The Cluster process receives a KILL from runit (because it timed out)
  2. Because of the change made in the worker, the child Sidekiq process can get stuck, so the TERM signal sent at the monitoring thread doesn't take effect
  3. As a final result, we end up with a (stuck) child Sidekiq process without a parent Cluster
  4. At this point gdk stop takes no effect as it can't reach this process
  5. gdk start fails to boot the rails-background-jobs (ps aux | grep runsv shows unable to lock supervise/lock: temporary failure). So it can quickly become a messy state to be.

One local workaround for that is pkill -9 -f 'runsv' (which is stuck), and also killing the Sidekiq process.

This MR makes the necessary changes to KILL the orphaned Sidekiq process if the TERM can't make it within 5 seconds.

Edited by Oswaldo Ferreira

Merge request reports