fix: gitlab-workhorse graceful termination
What does this MR do?
What
- Move
gitlab-workhorse
process to PID 1 via use ofexec
and array-form specification ofCMD
- Add feature flag to set
gitlab-workhorse
in PID 1. Off by default, and will be removed after successful rollout in GitLab.com
Why
GitLab-workhorse supports graceful termination,
however, we are not using it,
this causes the gitlab-workhorse
pod to run for 30s
on shutdown
responding with 502
and then receiving SIGKILL
.
By default Kubernetes sends SIGTERM
to PID 1
in the container, and
workhorse listens for this
signal
however workhorse is not PID
1 as seen in the process tree below, this
is because of 2 reasons:
-
CMD
isn't passed as an array. https://docs.docker.com/engine/reference/builder/#cmd in it specifies:CMD command param1 param2 (shell form)
so this setssh
as PID 1. - A shell script is
invoked
which creates
gitlab-workhorse
as a child process.
Process tree before:
git@gitlab-webservice-default-5d85b6854c-sbx2z:/$ ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1015 0.0 0.0 805036 4588 ? Rsl 13:12 0:00 runc init
git 1005 0.3 0.0 5992 3784 pts/0 Ss 13:12 0:00 bash
git 1014 0.0 0.0 8592 3364 pts/0 R+ 13:12 0:00 \_ ps faux
git 1 0.0 0.0 2420 532 ? Ss 12:52 0:00 /bin/sh -c /scripts/start-workhorse
git 16 0.0 0.0 5728 3408 ? S 12:52 0:00 /bin/bash /scripts/start-workhorse
git 19 0.0 0.3 1328480 33080 ? Sl 12:52 0:00 \_ gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml
Process tree after (with GITLAB_WORKHORSE_EXEC
set):
git@gitlab-webservice-default-84c68fc9c9-dzfd4:/$ ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
git 103 0.5 0.0 5992 3812 pts/0 Ss 07:33 0:00 bash
git 111 0.0 0.0 8592 3172 pts/0 R+ 07:33 0:00 \_ ps faux
git 1 0.1 0.3 1254496 32120 ? Ssl 07:32 0:00 gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml
Put this behind a feature flag to see if this causes any problem and we can slowly roll this out.
Testing
GITLAB_WORKHORSE_EXEC
not defined
values.yml
---
# values-minikube.yaml
# This example intended as baseline to use Minikube for the deployment of GitLab
# - Minimized CPU/Memory load, can fit into 3 CPU, 6 GB of RAM (barely)
# - Services that are not compatible with how Minikube runs are disabled
# - Some services entirely removed, or scaled down to 1 replica.
# - Configured to use 192.168.99.100, and nip.io for the domain
# Minimal settings
global:
ingress:
configureCertmanager: false
class: nginx
hosts:
domain: 192.168.99.100.nip.io
externalIP: 192.168.99.100
# Disable Rails bootsnap cache (temporary)
rails:
bootsnap:
enabled: false
shell:
# Configure the clone link in the UI to include the high-numbered NodePort
# value from below (`gitlab.gitlab-shell.service.nodePort`)
port: 32022
# Don't use certmanager, we'll self-sign
certmanager:
install: false
# Use the `ingress` addon, not our Ingress (can't map 22/80/443)
nginx-ingress:
enabled: false
# Save resources, only 3 CPU
prometheus:
install: false
gitlab-runner:
install: false
# Reduce replica counts, reducing CPU & memory requirements
gitlab:
webservice:
minReplicas: 1
maxReplicas: 1
sidekiq:
minReplicas: 1
maxReplicas: 1
gitlab-shell:
minReplicas: 1
maxReplicas: 1
# Map gitlab-shell to a high-numbered NodePort to support cloning over SSH since
# Minikube takes port 22.
service:
type: NodePort
nodePort: 32022
registry:
hpa:
minReplicas: 1
maxReplicas: 1
- Deploy helmchart
helm upgrade gitlab . --timeout 600s -f values.yaml --set global.hosts.domain=$(minikube ip).nip.io --set global.hosts.externalIP=$(minikube ip) --set gitlab.webservice.workhorse.image=registry.gitlab.com/gitlab- org/build/cng/gitlab-workhorse-ee --set gitlab.webservice.workhorse.tag=fix-workhorse-sigterm
- Check that process tree that
GitLab-workhorse
is not PID 1.$ kubectl exec $(kubectl get po -l app=webservice -o jsonpath='{..metadata.name}') -c gitlab-workhorse -- ps faux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND git 101 0.0 0.0 8592 3172 ? Rs 09:35 0:00 ps faux git 1 0.0 0.0 5728 3464 ? Ss 09:34 0:00 /bin/bash /scripts/start-workhorse git 19 0.0 0.3 1327968 32256 ? Sl 09:34 0:00 gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml
GITLAB_WORKHORSE_EXEC
defined
values.yml
---
# values-minikube.yaml
# This example intended as baseline to use Minikube for the deployment of GitLab
# - Minimized CPU/Memory load, can fit into 3 CPU, 6 GB of RAM (barely)
# - Services that are not compatible with how Minikube runs are disabled
# - Some services entirely removed, or scaled down to 1 replica.
# - Configured to use 192.168.99.100, and nip.io for the domain
# Minimal settings
global:
ingress:
configureCertmanager: false
class: nginx
hosts:
domain: 192.168.99.100.nip.io
externalIP: 192.168.99.100
# Disable Rails bootsnap cache (temporary)
rails:
bootsnap:
enabled: false
shell:
# Configure the clone link in the UI to include the high-numbered NodePort
# value from below (`gitlab.gitlab-shell.service.nodePort`)
port: 32022
# Don't use certmanager, we'll self-sign
certmanager:
install: false
# Use the `ingress` addon, not our Ingress (can't map 22/80/443)
nginx-ingress:
enabled: false
# Save resources, only 3 CPU
prometheus:
install: false
gitlab-runner:
install: false
# Reduce replica counts, reducing CPU & memory requirements
gitlab:
webservice:
minReplicas: 1
maxReplicas: 1
extraEnv:
GITLAB_WORKHORSE_EXEC: 1
sidekiq:
minReplicas: 1
maxReplicas: 1
gitlab-shell:
minReplicas: 1
maxReplicas: 1
# Map gitlab-shell to a high-numbered NodePort to support cloning over SSH since
# Minikube takes port 22.
service:
type: NodePort
nodePort: 32022
registry:
hpa:
minReplicas: 1
maxReplicas: 1
-
Deploy helmchart
helm upgrade gitlab . --timeout 600s -f values.yaml --set global.hosts.domain=$(minikube ip).nip.io --set global.hosts.externalIP=$(minikube ip) --set gitlab.webservice.workhorse.image=registry.gitlab.com/gitlab- org/build/cng/gitlab-workhorse-ee --set gitlab.webservice.workhorse.tag=fix-workhorse-sigterm
-
Check the process tree,
GitLab-workhorse
should be PID 1$ kubectl exec $(kubectl get po -l app=webservice -o jsonpath='{..metadata.name}') -c gitlab-workhorse -- ps faux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND git 613 0.0 0.0 8592 3212 ? Rs 09:32 0:00 ps faux git 597 0.0 0.0 5992 3812 pts/0 Ss+ 09:32 0:00 bash git 1 0.0 0.3 1328224 33028 ? Ssl 09:20 0:00 gitlab-workhorse -logFile stdout -logFormat json -listenAddr 0.0.0.0:8181 -documentRoot /srv/gitlab/public -secretPath /etc/gitlab/gitlab-workhorse/secret -config /srv/gitlab/config/workhorse-config.toml
Related issues
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497#note_910106152
Checklist
See Definition of done.
For anything in this list which will not be completed, please provide a reason in the MR discussion
Required
-
Merge Request Title, and Description are up to date, accurate, and descriptive -
MR targeting the appropriate branch -
MR has a green pipeline on GitLab.com
Expected (please provide an explanation if not completing)
-
Test plan indicating conditions for success has been posted and passes -
Documentation created/updated -
Integration tests added to GitLab QA -
The impact any change in container size has should be evaluated