Sidekiq Cluster (child) processes can be left behind if the parent is SIGKILL'd (#270) · Issues · GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker

Sidekiq Cluster (child) processes can be left behind if the parent is SIGKILL'd

After https://gitlab.com/gitlab-org/gitlab-development-kit/-/merge_requests/1128 was merged we've seen a few reports of people seeing a `timeout` when trying to start / restart GDK after doing a `gdk update`. The main change after this `gdk update` is that now [`bin/background_jobs`](https://gitlab.com/gitlab-org/gitlab/blob/4331091b0ba4bae0e74c5c015fe5e70f95055706/bin/background_jobs#L6) will run Sidekiq Cluster, instead sidekiq. That's how the process tree will look like for Sidekiq Cluster running a single Sidekiq process: ```bash › pstree 11170 -+= 11170 osw ruby bin/sidekiq-cluster * -P /Users/osw/projects/gitlab/gdk-ee/gitlab/tmp/pids/sidekiq-cluster.pid -e development \--= 11178 osw /Users/osw/.rbenv/versions/2.6.5/bin/sidekiq -c50 -edevelopment -t25 -gqueues:auto_devops:auto_devops_disable,auto_merge:auto_merge_p ... ``` And that's the process relationship between them: ```bash › ps j 11170 USER PID PPID PGID SESS JOBC STAT TT TIME COMMAND osw 11170 6028 11170 0 0 Ss ?? 0:00.27 ruby bin/sidekiq-cluster * -P /Users/osw/projects/gitlab/gdk-ee/gitlab/tmp/pids/sidekiq-cluster.pid -e development 2.6.5 in gitlab/ on master › ps j 11178 USER PID PPID PGID SESS JOBC STAT TT TIME COMMAND osw 11178 11170 11178 0 1 R ?? 0:35.16 sidekiq 5.2.7 queues:auto_devops:auto_devops_disable ``` I wasn't able to consistently reproduce that. One successful try was: 1. Changing GDK `Procfile` (removing `SIDEKIQ_WORKERS=1` env var) with all processes running 2. Trying `gdk stop` wasn't able to stop `rails-background-jobs` 3. Checking `ps aux | grep runsv` I saw `fatal: unable to lock supervise/lock` 4. Force killing `runsv` fixed the issue One detail raised by @jacobvosmaer-gitlab is that the [processes don't have the same `pgroup`](https://gitlab.com/gitlab-org/gitlab/blob/4331091b0ba4bae0e74c5c015fe5e70f95055706/lib/gitlab/sidekiq_cluster.rb#L103), and `runit` sends signals to the pgroups (e.g. `runit control/t: sending TERM to -11170` - `-` stands for signalling pgroups). That's not a real problem for the `TERM` signal, as we [trap and gracefully terminate the child processes in Sidekiq Cluster](https://gitlab.com/gitlab-org/gitlab/blob/4331091b0ba4bae0e74c5c015fe5e70f95055706/lib/gitlab/sidekiq_cluster/cli.rb#L129-137). But, if it timeouts, `runit` [will send a `KILL` after a few seconds (7 by default)](http://smarden.org/runit/sv.8.html). The kill signal is not trapped and can probably kill _just_ the parent (Sidekiq Cluster) and leave the children behind (http://morningcoffee.io/killing-a-process-and-all-of-its-descendants.html) - which I couldn't reproduce locally (sigkilling `sidekiq-cluster` also kills the children). Creating the issue given we also use `runit` at Omnibus and we didn't get into the bottom of it for GDK. I _think_ one of the reasons we're not seeing problems in production is because of the extra supervision at https://gitlab.com/gitlab-org/gitlab/blob/4331091b0ba4bae0e74c5c015fe5e70f95055706/config/initializers/sidekiq_cluster.rb#L9-18. This monitors and kills children that lost the master process.

issue