Sidekiq Cluster (child) processes can be left behind if the parent is SIGKILL'd
After gitlab-org/gitlab-development-kit!1128 (merged) was merged we've seen a few reports of people seeing a timeout
when trying to start / restart GDK after doing a gdk update
.
The main change after this gdk update
is that now bin/background_jobs
will run Sidekiq Cluster, instead sidekiq. That's how the process tree will look like for Sidekiq Cluster running a single Sidekiq process:
› pstree 11170
-+= 11170 osw ruby bin/sidekiq-cluster * -P /Users/osw/projects/gitlab/gdk-ee/gitlab/tmp/pids/sidekiq-cluster.pid -e development
\--= 11178 osw /Users/osw/.rbenv/versions/2.6.5/bin/sidekiq -c50 -edevelopment -t25 -gqueues:auto_devops:auto_devops_disable,auto_merge:auto_merge_p ...
And that's the process relationship between them:
› ps j 11170
USER PID PPID PGID SESS JOBC STAT TT TIME COMMAND
osw 11170 6028 11170 0 0 Ss ?? 0:00.27 ruby bin/sidekiq-cluster * -P /Users/osw/projects/gitlab/gdk-ee/gitlab/tmp/pids/sidekiq-cluster.pid -e development
2.6.5 in gitlab/ on master
› ps j 11178
USER PID PPID PGID SESS JOBC STAT TT TIME COMMAND
osw 11178 11170 11178 0 1 R ?? 0:35.16 sidekiq 5.2.7 queues:auto_devops:auto_devops_disable
I wasn't able to consistently reproduce that. One successful try was:
- Changing GDK
Procfile
(removingSIDEKIQ_WORKERS=1
env var) with all processes running - Trying
gdk stop
wasn't able to stoprails-background-jobs
- Checking
ps aux | grep runsv
I sawfatal: unable to lock supervise/lock
- Force killing
runsv
fixed the issue
One detail raised by @jacobvosmaer-gitlab is that the processes don't have the same pgroup
, and runit
sends signals to the pgroups (e.g. runit control/t: sending TERM to -11170
- -
stands for signalling pgroups). That's not a real problem for the TERM
signal, as we trap and gracefully terminate the child processes in Sidekiq Cluster. But, if it timeouts, runit
will send a KILL
after a few seconds (7 by default).
The kill signal is not trapped and can probably kill just the parent (Sidekiq Cluster) and leave the children behind (http://morningcoffee.io/killing-a-process-and-all-of-its-descendants.html) - which I couldn't reproduce locally (sigkilling sidekiq-cluster
also kills the children).
Creating the issue given we also use runit
at Omnibus and we didn't get into the bottom of it for GDK.
I think one of the reasons we're not seeing problems in production is because of the extra supervision at https://gitlab.com/gitlab-org/gitlab/blob/4331091b0ba4bae0e74c5c015fe5e70f95055706/config/initializers/sidekiq_cluster.rb#L9-18. This monitors and kills children that lost the master process.