Gitlab-Runner with docker executor does not stop the container if job was canceled.
Summary
Gitlab-Runner with docker executor does not stop the container if job was canceled. Instead, it attempts to execute sh script to send SIGTERM to processes and then immediately force removes the container when the script times out.
Steps to reproduce
Here is a simple script to reproduce the issue. It is an entrypoint for a Dockerfile that will spawn 10 processes along with the command. This is to make problem more visible - with 10s timeout graceGracefulExitFunc will be stuck trying to kill processes and will not deliver SIGTERM to all of them within timeout.
Dockerfile
FROM node:22.13-alpine3.21
WORKDIR /etc/app
COPY entrypoint.js /usr/local/bin/entrypoint.js
ENTRYPOINT ["node", "/usr/local/bin/entrypoint.js"]
entrypoint.js
const isParent = !process.env.IS_CHILD;
function log(message) {
process.stdout.write(`[${process.pid}][${isParent ? 'parent' : 'child'}] ${message}\n`);
}
function logError(message) {
process.stderr.write(`[${process.pid}][${isParent ? 'parent' : 'child'}] ${message}\n`);
}
function waitForChild(child) {
return new Promise((resolve) => {
child.on('exit', (code, signal) => {
if (signal) {
log(`Child process[${child.pid}] was terminated by signal: ${signal}`);
} else {
log(`Child process[${child.pid}] exited with code: ${code}`);
}
resolve();
});
child.on('error', (err) => {
logError(`Child process[${child.pid}] errored: ${err}`);
resolve();
});
});
};
if (isParent) {
const childProcesses = [];
for (let i = 0; i < 10; i++) {
const child = child_process.fork(__filename, {env: {IS_CHILD: 'true'}, stdio: ['ignore', 'inherit', 'inherit', 'ipc']});
childProcesses.push(child);
}
log('Forked 10 children');
Promise.all(childProcesses.map(waitForChild)).then(() => log('All children exited'));
const rawArgs = process.argv.slice(2);
const command = rawArgs[0];
const args = rawArgs.slice(1);
if (command) {
log(`Starting command: ${command} ${args.join(' ')}`);
const commandProcess = child_process.spawn(command, args, {
stdio: 'inherit',
env: process.env
});
waitForChild(commandProcess).then(() => {
log(`Command ${command} exited. Killing dummy processes in 30s`);
setTimeout(() => {
childProcesses.map((child) => child.kill('SIGKILL'));
}, 30000);
});
process.on('SIGTERM', () => {
childProcesses.map((child) => child.kill('SIGTERM'));
});
} else {
log('No command provided');
}
} else {
log('I am a child');
const startedAt = Date.now();
setInterval(() => log(`Still alive for ${Date.now() - startedAt}ms`), 30000);
}
process.on('SIGTERM', () => log('Received SIGTERM'));
process.on('SIGINT', () => log('Received SIGINT'));
.gitlab-ci.yml
- test
test:
image: gitlab-runner-test:v0.0.1 #image built from Dockerfile and entrypoint.js above
stage: test
tags:
- docker, ansible, dev
allow_failure: true
script:
- >
- echo 'test'
- sleep 360
- echo 'done'
Actual behavior
When job is cancelled, gitlab runner executes graceGracefulExitFunc sending SIGTERM to processes in container. However, it has timeout of 10 seconds and does not always terminate all the processes within that timeout.
Then container is force removed then, without any attempt to execute stop command on it.
(See attached gitlab-runner and container logs.)
Therefore entrypoint with pid 1 does not receive SIGTERM and has no chance to do proper cleanup. As well as any parent process if it spawns enough child processes.
Expected behavior
After attempting to execute graceGracefulExitFunc to kill child processes and timing out, gitlab-runner always attempts to execute stop on container to give a chance of proper cleanup.
Relevant logs and/or screenshots
job log from container
[13][child] I am a child
[12][child] I am a child
[20][child] I am a child
[37][child] I am a child
[14][child] I am a child
[39][child] I am a child
[26][child] I am a child
[28][child] I am a child
[30][child] I am a child
[49][child] I am a child
[1][parent] Forked 10 children
[1][parent] Starting command: sh -c if [ -x /usr/local/bin/bash ]; then
exec /usr/local/bin/bash
elif [ -x /usr/bin/bash ]; then
exec /usr/bin/bash
elif [ -x /bin/bash ]; then
exec /bin/bash
elif [ -x /usr/local/bin/sh ]; then
exec /usr/local/bin/sh
elif [ -x /usr/bin/sh ]; then
exec /usr/bin/sh
elif [ -x /bin/sh ]; then
exec /bin/sh
elif [ -x /busybox/sh ]; then
exec /busybox/sh
else
echo shell not found
exit 1
fi
$ echo 'test'
test
$ sleep 360
[13][child] Still alive for 30030ms
[12][child] Still alive for 30016ms
[20][child] Still alive for 30030ms
[37][child] Still alive for 30029ms
[39][child] Still alive for 30003ms
[14][child] Still alive for 30030ms
[28][child] Still alive for 30027ms
[26][child] Still alive for 30029ms
[30][child] Still alive for 30017ms
[49][child] Still alive for 30030ms
[13][child] Still alive for 60049ms
[12][child] Still alive for 60019ms
[20][child] Still alive for 60044ms
[37][child] Still alive for 60057ms
[39][child] Still alive for 60022ms
[14][child] Still alive for 60051ms
[30][child] Still alive for 60031ms
[26][child] Still alive for 60052ms
[28][child] Still alive for 60052ms
[49][child] Still alive for 60059ms
Terminated
[1][parent] Child process[56] exited with code: 143
[1][parent] Command sh exited. Killing dummy processes in 30s
[49][child] Received SIGTERM
[39][child] Received SIGTERM
[root@aws11-g09-ggr14 ~]#
gitlab-runner logs
Mar 07 09:00:54 <host> gitlab-runner[25827]: WARNING: Appending trace to coordinator... job failed code=403 job=30998224 job-log= job-status= runner=uLZXhVKj sent-log=3855-4214 status=403 Forbidden update-interval=0s
Mar 07 09:00:54 <host> gitlab-runner[25827]: Container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc finished with aborted job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:00:54 <host> gitlab-runner[25827]: Emitting SIGTERM to processes in container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:00:54 <host> gitlab-runner[25827]: Waiting for build to finish... error=canceled job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removing container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Disconnecting container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc from networks job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Can't get network list. ListNetworks exited with Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.39/networks": context canceled (docker.go:837:0s) job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removing container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc finished with error Delete "http://%2Fvar%2Frun%2Fdocker.sock/v1.39/containers/071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc?force=1&v=1": context canceled (docker.go:824:0s) job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: WARNING: after_script failed, but job will continue unaffected: context canceled job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Skipping referees execution job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: WARNING: Error while executing file based variables removal script error=context canceled job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removing container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Disconnecting container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc from networks job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removing container dbae3877c1dafcf7bf7679dd2592e6c091749820378c29122570c28641817f44 job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Disconnecting container dbae3877c1dafcf7bf7679dd2592e6c091749820378c29122570c28641817f44 from networks job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removed container dbae3877c1dafcf7bf7679dd2592e6c091749820378c29122570c28641817f44 job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: Removed container 071a53300a84895e064aecc775b95a0837bc0fb912655632f4a6667f564f2acc job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: duration_s=106.342801755 job=30998224 project=28069 runner=uLZXhVKj
Mar 07 09:01:04 <host> gitlab-runner[25827]: WARNING: Appending trace to coordinator... job failed code=403 job=30998224 job-log= job-status= runner=uLZXhVKj sent-log=3855-4570 status=403 Forbidden update-interval=0s
Mar 07 09:01:04 <host> gitlab-runner[25827]: Updating job... bytesize=4571 checksum=crc32:c4a694a6 job=30998224 runner=uLZXhVKj
Mar 07 09:01:05 <host> gitlab-runner[25827]: WARNING: Submitting job to coordinator... job failed bytesize=4571 checksum=crc32:c4a694a6 code=403 job=30998224 job-status= runner=uLZXhVKj status=PUT https://<gitlab host>/api/v4/jobs/30998224: 403 Forbidden update-interval=0s
Mar 07 09:01:05 <host> gitlab-runner[25827]: Removed job from processing list
Environment description
We are using self-hosted gitlab and gitlab-runner with docker executor
docker info
# docker info
Server Version: 18.09.9-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.101-91.76.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.503GiB
...
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Used GitLab Runner version
$ gitlab-runner --version
Version: 17.2.0
Git revision: 6428c288
Git branch: 17-2-stable
GO version: go1.22.5
Built: 2024-07-18T14:00:52+0000
OS/Arch: linux/amd64
Possible fixes
The problem as I see it is in dockerWaiter.retryWait(), specifically:
for ctx.Err() == nil { // <-- here
err := d.wait(ctx, containerID, stopFn)
When job is canceled, ctx.Err() will not be nil (as log Container <id> finished with aborted suggests that is the case) and therefore the stopFn is never called, and d.client.ContainerStop never executed.
I'm not sure of the propper fix, but I think in case of canceled job ContainerStop should be executed at least once.