Sidekiq Cluster (child) processes can be left behind if the parent is SIGKILL'd

After !1128 (merged) was merged we've seen a few reports of people seeing a timeout when trying to start / restart GDK after doing a gdk update.

The main change after this gdk update is that now bin/background_jobs will run Sidekiq Cluster, instead sidekiq. That's how the process tree will look like for Sidekiq Cluster running a single Sidekiq process:

› pstree 11170
-+= 11170 osw ruby bin/sidekiq-cluster * -P /Users/osw/projects/gitlab/gdk-ee/gitlab/tmp/pids/sidekiq-cluster.pid -e development
 \--= 11178 osw /Users/osw/.rbenv/versions/2.6.5/bin/sidekiq -c50 -edevelopment -t25 -gqueues:auto_devops:auto_devops_disable,auto_merge:auto_merge_p ...

And that's the process relationship between them:

› ps j 11170
USER   PID  PPID  PGID   SESS JOBC STAT   TT       TIME COMMAND
osw  11170  6028 11170      0    0 Ss     ??    0:00.27 ruby bin/sidekiq-cluster * -P /Users/osw/projects/gitlab/gdk-ee/gitlab/tmp/pids/sidekiq-cluster.pid -e development

2.6.5 in gitlab/ on master
› ps j 11178
USER   PID  PPID  PGID   SESS JOBC STAT   TT       TIME COMMAND
osw  11178 11170 11178      0    1 R      ??    0:35.16 sidekiq 5.2.7 queues:auto_devops:auto_devops_disable

I wasn't able to consistently reproduce that. One successful try was:

  1. Changing GDK Procfile (removing SIDEKIQ_WORKERS=1 env var) with all processes running
  2. Trying gdk stop wasn't able to stop rails-background-jobs
  3. Checking ps aux | grep runsv I saw fatal: unable to lock supervise/lock
  4. Force killing runsv fixed the issue

One detail raised by @jacobvosmaer-gitlab is that the processes don't have the same pgroup, and runit sends signals to the pgroups (e.g. runit control/t: sending TERM to -11170 - - stands for signalling pgroups). That's not a real problem for the TERM signal, as we trap and gracefully terminate the child processes in Sidekiq Cluster. But, if it timeouts, runit will send a KILL after a few seconds (7 by default).

The kill signal is not trapped and can probably kill just the parent (Sidekiq Cluster) and leave the children behind (http://morningcoffee.io/killing-a-process-and-all-of-its-descendants.html) - which I couldn't reproduce locally (sigkilling sidekiq-cluster also kills the children).

Creating the issue given we also use runit at Omnibus and we didn't get into the bottom of it for GDK.

I think one of the reasons we're not seeing problems in production is because of the extra supervision at https://gitlab.com/gitlab-org/gitlab/blob/4331091b0ba4bae0e74c5c015fe5e70f95055706/config/initializers/sidekiq_cluster.rb#L9-18. This monitors and kills children that lost the master process.

Edited by Oswaldo Ferreira