Improve Sidekiq process hard shutdown (Cluster mode) (!28734) · Merge requests · GitLab.org / GitLab

Oswaldo Ferreira requested to merge osw-improve-stuck-sidekiq-processes-termination into master Apr 03, 2020

What does this MR do?

There is an edge case caught while running sidekiq-cluster in GDK (through runit) at gitlab-development-kit#844 (moved):

When making a code change in a worker and eventually restarting the rails-background-jobs process (gdk restart rails-background-jobs), it's possible that:

The Cluster process receives a KILL from runit (because it timed out)
Because of the change made in the worker, the child Sidekiq process can get stuck, so the TERM signal sent at the monitoring thread doesn't take effect
As a final result, we end up with a (stuck) child Sidekiq process without a parent Cluster
At this point gdk stop takes no effect as it can't reach this process
gdk start fails to boot the rails-background-jobs (ps aux | grep runsv shows unable to lock supervise/lock: temporary failure). So it can quickly become a messy state to be.

One local workaround for that is pkill -9 -f 'runsv' (which is stuck), and also killing the Sidekiq process.

This MR makes the necessary changes to KILL the orphaned Sidekiq process if the TERM can't make it within 5 seconds.

Edited Apr 03, 2020 by Oswaldo Ferreira

Improve Sidekiq process hard shutdown (Cluster mode)

What does this MR do?

Merge request reports