Skip to content

Implement graceful build container shutdown for docker executor

What does this MR do?

This MR implements graceful shutdown of build containers in docker executor. It does this in two ways:

  1. Uses Init:true in container.HostConfig, which runs tini-init as PID 1. This is equivalent to including --init in docker run or docker create and is available in docker>=1.13 (see https://github.com/krallin/tini). This will propagate the SIGTERM sent by docker on docker stop to PID 1 to children of PID 1.
  2. docker execs into the build container and sends SIGTERM to all process excluding 1 and itself, in decreasing numerical pid order. This part is only implemented for bash/sh shell.

Even with 1 inplace, 2 is necessary to handle the case when shells are in the mix (which is always), since shells do no propagate signals to children. To properly send SIGTERM (or any signal) to a process spawned by a shell, we have to send the signal directly to that process. We send signals in decreasing pid order positing that the leaf (and thus higher numbered) processes are the ones that are long running and preventing the container from exiting, and terminating them will allow the rest of the process tree to terminate naturally. Note that any shell processes that are sent signals will ignore them anyway, so this approach is less heavy-handed than it might appear at first.

Note: best reviewed commit at a time.

Note: I'm working on adding an integration test.

Why was this MR needed?

Enable graceful shutdown of build containers when a job has been cancelled or times out, to properly enable cleanup of resources created by the job.

What's the best way to test this MR?

Run the following job on a runner with docker executor:

.gitlab-ci.yml

stages:
  - test

test:
  timeout: 15s
  stage: test
  image: ubuntu:latest # this can be any container
  script:
    - ./long-script-with-cleanup.sh

long-script-with-cleanup.sh

#!/bin/sh

rm -f output.txt

cleanup() {
	echo "caught signal $1" | tee -a output.txt
	sleep 3
	echo "exiting..." | tee -a output.txt
	exit 0
}

trap 'cleanup "SIGTERM"' TERM

for i in $(seq 1 60); do
	echo "$i" | tee -a output.txt
	sleep 1
done
  1. let the job run until it times out
  2. cancel the job from the UI before the timeout

Then inspect the docker volumes for the output file created by the job script.

> sudo cat $(sudo find docker/volumes -name output.txt)
1
2
3
4
5
6
7
8
9
10
11
12
13
caught signal SIGTERM
exiting...

In both cases the lines :

caught signal SIGTERM
exiting...

indicate the process received SIGTERM AND was allowed to exit before being SIGKILLed

What are the relevant issue numbers?

Closes #6359 (closed)

Edited by Axel von Bertoldi

Merge request reports