CNG: Main processes are not PID 1
Summary
By default Kubernets will send a SIGTERM
to the container with the pod is told to stop. The signal is sent to PID 1 of the container. Sometimes the processes that listen for SIGTERM
isn't PID 1, such as gitlab-workhorse
or gitlab-shell
. We've seen this being a problem because GitLab-workhorse
graceful shutdown wasn't working.
This can be of 2 reasons and sometimes both of them together.
- We use
CMD command
which invokes a shell, so thecommand is a child process, which doesn't get
SIGTERM`If you use the shell form of the CMD, then the <command> will execute in /bin/sh -c:
-
CMD
points to a script, and doesn't useexec
so the process is a child process.
We sometimes work around this by using pkill
as a preStop
hook but we can easily forget that, and it is not default behavior of our container.
Current behavior
git@gitlab-gitlab-shell-77cf75d847-lnflg:/$ ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
git 11056 0.2 0.0 5992 3636 pts/0 Ss 11:20 0:00 bash
git 11072 0.0 0.0 8592 3320 pts/0 R+ 11:20 0:00 \_ ps faux
git 1 0.0 0.0 2420 524 ? Ss 08:31 0:00 /bin/sh -c "/scripts/process-wrapper"
git 17 0.0 0.0 5868 3440 ? S 08:31 0:00 /bin/bash /scripts/process-wrapper
git 21 0.0 0.0 4256 564 ? S 08:31 0:00 \_ tail -f /var/log/gitlab-shell/gitlab-shell.log /var/log/gitlab-shell/ssh.log
git 22 0.0 0.0 13292 7620 ? S 08:31 0:01 \_ sshd: /usr/sbin/sshd -D -E /var/log/gitlab-shell/ssh.log [listener] 0 of 10-100 startups
git@gitlab-webservice-default-5d85b6854c-sbx2z:/$ ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1015 0.0 0.0 805036 4588 ? Rsl 13:12 0:00 runc init
git 1005 0.3 0.0 5992 3784 pts/0 Ss 13:12 0:00 bash
git 1014 0.0 0.0 8592 3364 pts/0 R+ 13:12 0:00 \_ ps faux
git 1 0.0 0.0 2420 532 ? Ss 12:52 0:00 /bin/sh -c /scripts/start-workhorse
git 16 0.0 0.0 5728 3408 ? S 12:52 0:00 /bin/bash /scripts/start-workhorse
git 19 0.0 0.3 1328480 33080 ? Sl 12:52 0:00 \_ gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml
Expected behavior
git@gitlab-webservice-default-84c68fc9c9-dzfd4:/$ ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
git 103 0.5 0.0 5992 3812 pts/0 Ss 07:33 0:00 bash
git 111 0.0 0.0 8592 3172 pts/0 R+ 07:33 0:00 \_ ps faux
git 1 0.1 0.3 1254496 32120 ? Ssl 07:32 0:00 gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml
Affected containers
-
gitlab-workhorse
👉 gitlab-org/build/CNG!972 (merged) -
gitlab-shell
👉 gitlab-org/build/CNG!977 (merged) -
gitaly
👉 gitlab-org/build/CNG!978 (merged) -
container-registry
👉 gitlab-org/build/CNG!980 (merged) -
gitlab-pages
👉 gitlab-org/build/CNG!979 (merged) -
webservice
👉 gitlab-org/build/CNG!1010 (merged) -
geo-logcursor
👉 gitlab-org/build/CNG!1009 (merged) -
sidekiq
👉 gitlab-org/build/CNG!1006 (merged) -
mailroom
👉 gitlab-org/build/CNG!992 (merged) -
GitLab-exporter
👉 gitlab-org/build/CNG!990 (merged)
Definition of Done
-
Update the affected containers so that the main process is PID 1 so that the signal is sent correctly -
Add a limiting rule to possible preventDone in a follow up issueCMD command
and always prepare aCMD ["command"]
such as hadolint rule👉 #3253 -
Optinal: Remove the pkill
that we have in the charts that workaround this.👉 #3249 (comment 935872036)
Actionable items
- The containers to use
CMD []
syntax - Process startup scripts use
exec
or trap & pass signals to service children (not all processes can run single) - Charts to be evaluated for the need to update or remove any
pre*
handlers, and/or liveness/readiness probe changes👉 #3249 (comment 935872036)
Edited by Jason Plum