Use a lightweight init daemon to prevent zombie processes
In gitlab-org/charts/gitlab#3249 (closed), we moved all processes to PID 1 to ensure that they receive SIGTERM when the container shuts down.
However, PID 1 has a special responsibility as described in https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/:
Its task is to "adopt" orphaned child processes (again, this is the actual technical term). This means that the init process becomes the parent of such processes, even though those processes were never created directly by the init process.
This was brought to our attention by Puma v6.4.1, which was deployed a few days with https://github.com/puma/puma/pull/3255. That change attempts to reap child processes for PID 1. Unfortunately, for some reason on my GKE nodes, Process.wait2(-1)
doesn't appear to be returning reaped child processes on PID 1 (https://github.com/puma/puma/issues/3313), which might have caused this incident on websockets
nodes: gitlab-com/gl-infra/production#17372 (closed). We probably could have avoided this incident if puma
were not PID 1.
When Puma v6.4.1 was deployed on GitLab.com for a few days, we saw close to a million log messages relating to reaping unknown child processes (https://log.gprd.gitlab.net/app/r/s/arg1c), mostly on web
nodes:
As explained in the links above, this can happen if a subprocess doesn't clean up its children processes properly.
We currently don't have visibility into how many zombie processes there are, but the data above suggests this is a common occurrence. I'd expect to see something similar with Sidekiq, which doesn't have built-in reaping.
Proposal
We should consider using something like tini
for all the processes.
@WarheadsSE @sxuereb Did we ever consider this?