Gitlab-Runner with docker executor does not stop the container if job was canceled.

Summary

Gitlab-Runner with docker executor does not stop the container if job was canceled. Instead, it attempts to execute sh script to send SIGTERM to processes and then immediately force removes the container when the script times out.

Steps to reproduce

Here is a simple script to reproduce the issue. It is an entrypoint for a Dockerfile that will spawn 10 processes along with the command. This is to make problem more visible - with 10s timeout graceGracefulExitFunc will be stuck trying to kill processes and will not deliver SIGTERM to all of them within timeout.

Dockerfile
FROM node:22.13-alpine3.21

WORKDIR /etc/app

COPY entrypoint.js /usr/local/bin/entrypoint.js

ENTRYPOINT ["node", "/usr/local/bin/entrypoint.js"]
entrypoint.js
const isParent = !process.env.IS_CHILD;

function log(message) {
	process.stdout.write(`[${process.pid}][${isParent ? 'parent' : 'child'}] ${message}\n`);
}
function logError(message) {
	process.stderr.write(`[${process.pid}][${isParent ? 'parent' : 'child'}] ${message}\n`);
}

function waitForChild(child) {
	return new Promise((resolve) => {
		child.on('exit', (code, signal) => {
			if (signal) {
				log(`Child process[${child.pid}] was terminated by signal: ${signal}`);
			} else {
				log(`Child process[${child.pid}] exited with code: ${code}`);
			}
			resolve();
		});

		child.on('error', (err) => {
			logError(`Child process[${child.pid}] errored: ${err}`);
			resolve();
		});
	});
};

if (isParent) {
	const childProcesses = [];
	for (let i = 0; i < 10; i++) {
		const child = child_process.fork(__filename, {env: {IS_CHILD: 'true'}, stdio: ['ignore', 'inherit', 'inherit', 'ipc']});
		childProcesses.push(child);
	}

    log('Forked 10 children');
	Promise.all(childProcesses.map(waitForChild)).then(() => log('All children exited'));

	const rawArgs = process.argv.slice(2);
	const command = rawArgs[0];
	const args = rawArgs.slice(1);
	if (command) {
		log(`Starting command: ${command} ${args.join(' ')}`);
		const commandProcess = child_process.spawn(command, args, {
			stdio: 'inherit',
			env: process.env
		});
		waitForChild(commandProcess).then(() => {
			log(`Command ${command} exited. Killing dummy processes in 30s`);
			setTimeout(() => {
				childProcesses.map((child) => child.kill('SIGKILL'));
			}, 30000);
		});
		process.on('SIGTERM', () => {
			childProcesses.map((child) => child.kill('SIGTERM'));
		});
	} else {
		log('No command provided');
	}
} else {
	log('I am a child');
	const startedAt = Date.now();
	setInterval(() => log(`Still alive for ${Date.now() - startedAt}ms`), 30000);
}

process.on('SIGTERM', () => log('Received SIGTERM'));
process.on('SIGINT', () => log('Received SIGINT'));
.gitlab-ci.yml
    - test
test:
    image: gitlab-runner-test:v0.0.1 #image built from Dockerfile and entrypoint.js above
    stage: test
    tags:
      - docker, ansible, dev
    allow_failure: true
    script:
        - >
        - echo 'test'
        - sleep 360
        - echo 'done'

Actual behavior

When job is cancelled, gitlab runner executes graceGracefulExitFunc sending SIGTERM to processes in container. However, it has timeout of 10 seconds and does not always terminate all the processes within that timeout.
Then container is force removed then, without any attempt to execute stop command on it.
(See attached gitlab-runner and container logs.)

Therefore entrypoint with pid 1 does not receive SIGTERM and has no chance to do proper cleanup. As well as any parent process if it spawns enough child processes.

Expected behavior

After attempting to execute graceGracefulExitFunc to kill child processes and timing out, gitlab-runner always attempts to execute stop on container to give a chance of proper cleanup.

Relevant logs and/or screenshots

job log from container
[13][child] I am a child
[12][child] I am a child
[20][child] I am a child
[37][child] I am a child
[14][child] I am a child
[39][child] I am a child
[26][child] I am a child
[28][child] I am a child
[30][child] I am a child
[49][child] I am a child
[1][parent] Forked 10 children
[1][parent] Starting command: sh -c if [ -x /usr/local/bin/bash ]; then
	exec /usr/local/bin/bash 
elif [ -x /usr/bin/bash ]; then
	exec /usr/bin/bash 
elif [ -x /bin/bash ]; then
	exec /bin/bash 
elif [ -x /usr/local/bin/sh ]; then
	exec /usr/local/bin/sh 
elif [ -x /usr/bin/sh ]; then
	exec /usr/bin/sh 
elif [ -x /bin/sh ]; then
	exec /bin/sh 
elif [ -x /busybox/sh ]; then
	exec /busybox/sh 
else
	echo shell not found
	exit 1
fi


$ echo 'test'
test
$ sleep 360
[13][child] Still alive for 30030ms
[12][child] Still alive for 30016ms
[20][child] Still alive for 30030ms
[37][child] Still alive for 30029ms
[39][child] Still alive for 30003ms
[14][child] Still alive for 30030ms
[28][child] Still alive for 30027ms
[26][child] Still alive for 30029ms
[30][child] Still alive for 30017ms
[49][child] Still alive for 30030ms
[13][child] Still alive for 60049ms
[12][child] Still alive for 60019ms
[20][child] Still alive for 60044ms
[37][child] Still alive for 60057ms
[39][child] Still alive for 60022ms
[14][child] Still alive for 60051ms
[30][child] Still alive for 60031ms
[26][child] Still alive for 60052ms
[28][child] Still alive for 60052ms
[49][child] Still alive for 60059ms
Terminated
[1][parent] Child process[56] exited with code: 143
[1][parent] Command sh exited. Killing dummy processes in 30s
[49][child] Received SIGTERM
[39][child] Received SIGTERM
[root@aws11-g09-ggr14 ~]# 
gitlab-runner logs
Mar 07 09:00:54 <host> gitlab-runner[25827]: WARNING: Appending trace to coordinator... job failed  code=403 job=30998224 job-log= job-status= runner=uLZXhVKj sent-log=3855-4214 status=403 Forbidden update-interval=0s
Mar 07 09:00:54 <host> gitlab-runner[25827]: Container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc finished with aborted  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:00:54 <host> gitlab-runner[25827]: Emitting SIGTERM to processes in container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:00:54 <host> gitlab-runner[25827]: Waiting for build to finish...                      error=canceled job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removing container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Disconnecting container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc from networks  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Can't get network list. ListNetworks exited with Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.39/networks": context canceled (docker.go:837:0s)  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removing container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc finished with error Delete "http://%2Fvar%2Frun%2Fdocker.sock/v1.39/containers/071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc?force=1&v=1": context canceled (docker.go:824:0s)  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: WARNING: after_script failed, but job will continue unaffected: context canceled  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Skipping referees execution                         job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: WARNING: Error while executing file based variables removal script  error=context canceled job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removing container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Disconnecting container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc from networks  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removing container dbae3877c1dafcf7bf7679dd2592e6c091749820378c29122570c28641817f44  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Disconnecting container dbae3877c1dafcf7bf7679dd2592e6c091749820378c29122570c28641817f44 from networks  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removed container dbae3877c1dafcf7bf7679dd2592e6c091749820378c29122570c28641817f44  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removed container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc  job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]:   duration_s=106.342801755 job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: WARNING: Appending trace to coordinator... job failed  code=403 job=30998224 job-log= job-status= runner=uLZXhVKj sent-log=3855-4570 status=403 Forbidden update-interval=0s
Mar 07 09:01:04 <host> gitlab-runner[25827]: Updating job...                                     bytesize=4571 checksum=crc32:c4a694a6 job=30998224 runner=uLZXhVKj
Mar 07 09:01:05 <host> gitlab-runner[25827]: WARNING: Submitting job to coordinator... job failed  bytesize=4571 checksum=crc32:c4a694a6 code=403 job=30998224 job-status= runner=uLZXhVKj status=PUT https://<gitlab host>/api/v4/jobs/30998224: 403 Forbidden update-interval=0s
Mar 07 09:01:05 <host> gitlab-runner[25827]: Removed job from processing list

Environment description

We are using self-hosted gitlab and gitlab-runner with docker executor

docker info
# docker info
Server Version: 18.09.9-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.101-91.76.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.503GiB
...
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Used GitLab Runner version

$ gitlab-runner --version
Version:      17.2.0
Git revision: 6428c288
Git branch:   17-2-stable
GO version:   go1.22.5
Built:        2024-07-18T14:00:52+0000
OS/Arch:      linux/amd64

Possible fixes

The problem as I see it is in dockerWaiter.retryWait(), specifically:

for ctx.Err() == nil { // <-- here
		err := d.wait(ctx, containerID, stopFn)

When job is canceled, ctx.Err() will not be nil (as log Container <id> finished with aborted suggests that is the case) and therefore the stopFn is never called, and d.client.ContainerStop never executed.

I'm not sure of the propper fix, but I think in case of canceled job ContainerStop should be executed at least once.

Edited by Armen Shakhbazian