Skip to content

CNG: Main processes are not PID 1

Summary

By default Kubernets will send a SIGTERM to the container with the pod is told to stop. The signal is sent to PID 1 of the container. Sometimes the processes that listen for SIGTERM isn't PID 1, such as gitlab-workhorse or gitlab-shell. We've seen this being a problem because GitLab-workhorse graceful shutdown wasn't working.

This can be of 2 reasons and sometimes both of them together.

  1. We use CMD command which invokes a shell, so the command is a child process, which doesn't get SIGTERM`

    If you use the shell form of the CMD, then the <command> will execute in /bin/sh -c:

  2. CMD points to a script, and doesn't use execso the process is a child process.

We sometimes work around this by using pkill as a preStop hook but we can easily forget that, and it is not default behavior of our container.

Current behavior

git@gitlab-gitlab-shell-77cf75d847-lnflg:/$ ps faux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
git        11056  0.2  0.0   5992  3636 pts/0    Ss   11:20   0:00 bash
git        11072  0.0  0.0   8592  3320 pts/0    R+   11:20   0:00  \_ ps faux
git            1  0.0  0.0   2420   524 ?        Ss   08:31   0:00 /bin/sh -c "/scripts/process-wrapper"
git           17  0.0  0.0   5868  3440 ?        S    08:31   0:00 /bin/bash /scripts/process-wrapper
git           21  0.0  0.0   4256   564 ?        S    08:31   0:00  \_ tail -f /var/log/gitlab-shell/gitlab-shell.log /var/log/gitlab-shell/ssh.log
git           22  0.0  0.0  13292  7620 ?        S    08:31   0:01  \_ sshd: /usr/sbin/sshd -D -E /var/log/gitlab-shell/ssh.log [listener] 0 of 10-100 startups
git@gitlab-webservice-default-5d85b6854c-sbx2z:/$ ps faux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1015  0.0  0.0 805036  4588 ?        Rsl  13:12   0:00 runc init
git         1005  0.3  0.0   5992  3784 pts/0    Ss   13:12   0:00 bash
git         1014  0.0  0.0   8592  3364 pts/0    R+   13:12   0:00  \_ ps faux
git            1  0.0  0.0   2420   532 ?        Ss   12:52   0:00 /bin/sh -c /scripts/start-workhorse
git           16  0.0  0.0   5728  3408 ?        S    12:52   0:00 /bin/bash /scripts/start-workhorse
git           19  0.0  0.3 1328480 33080 ?       Sl   12:52   0:00  \_ gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml

Expected behavior

git@gitlab-webservice-default-84c68fc9c9-dzfd4:/$ ps faux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
git          103  0.5  0.0   5992  3812 pts/0    Ss   07:33   0:00 bash
git          111  0.0  0.0   8592  3172 pts/0    R+   07:33   0:00  \_ ps faux
git            1  0.1  0.3 1254496 32120 ?       Ssl  07:32   0:00 gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml

Affected containers

Definition of Done

  • Update the affected containers so that the main process is PID 1 so that the signal is sent correctly
  • Add a limiting rule to possible prevent CMD command and always prepare a CMD ["command"] such as hadolint rule Done in a follow up issue 👉 #3253
  • Optinal: Remove the pkill that we have in the charts that workaround this. 👉 #3249 (comment 935872036)

Actionable items

  • The containers to use CMD [] syntax
  • Process startup scripts use exec or trap & pass signals to service children (not all processes can run single)
  • Charts to be evaluated for the need to update or remove any pre* handlers, and/or liveness/readiness probe changes 👉 #3249 (comment 935872036)
Edited by Jason Plum