Job execution pods in Kubernetes don't handle signals

Status update (2023-03-17)

We analyzed options to resolve this bug in this spike. However, we did not land on any definitive solutions. The investigation has so far looked at the following:

Change need in the detect shell script so it could look like shell to make sure the signal is propagated
Even after those changes, the PID 1 still won’t forward the signal received to the PID running our job. Manually sshing on the build container and sending the TERM, KILL signals would successfully gracefully stop everything
The only solution not tested is to use tini in the runner image.

Summary

Normally, when starting containers in Kubernetes there is a process that runs in the container, handling TERM/KILL signals sent by Kubernetes. Kubernetes sends signals to process 1 in each container in a pod to let it know it's time to shut down. It first sends a TERM signal and then after the terminationGracePeriodSeconds it sends a KILL signal.

For build containers, it seems to work differently. PID 1 in the build container of a job execution pod is just a shell and this doesn't appear to handle the TERM signal so pods live as long as the terminationGracePeriodSeconds after Kubernetes starts the pod termination process. This renders the Kubernetes lifecycle hook support a bit useless as even once the hook is done, you can't shut down the pod. If PID 1 was something that could handle signals it would allow for cleanup of the job execution pod once the lifecycle hook is executed.

Steps to reproduce

Set your job pod's terminationGracePeriodSeconds to 30 in your config.toml
Start a pipeline with a sleep 120
While the sleep is running, delete the job pod.
Notice how the process running the sleep is not terminated, as it doesn't receive the TERM signal
Notice the job pod remaining running as if nothing happened for 30s and then being killed by the container runtime.

.gitlab-ci.yml

sleep 120

Actual behavior

Job execution pods will always live as long as their terminationGracePeriodSeconds as they don't handle Kubernetes TERM signals meaning you can't finish the work of the job and exit the job cleanly.

Expected behavior

PID 1 should handle signals and exit cleanly when a job is done without requiring the Gitlab runner to do the cleanup.

Environment description

Gitlab version: 13.8.8-ee Executor: Kubernetes

Used GitLab Runner version

$ gitlab-runner --version
Version:      14.2.0
Git revision: 58ba2b95
Git branch:   14-2-stable
GO version:   go1.13.8
Built:        2021-08-22T19:47:58+0000
OS/Arch:      linux/amd64

Edited Mar 23, 2023 by Darren Eastman