Shell runner doesn't kill hanging process
Summary
Using a Gitlab runner with a shell executor, we have noticed some jobs can leave a process hanging on the machine even after the job's timeout: i.e. the runner doesn't kill the process it started.
This happens with a misbehaving process which hangs for no reason but still I would expect the runner to kill it after timeout. If we go to the machine and manually send a SIGKILL to the process, it stops immediately.
Steps to reproduce
Here's one way of hopefully reproducing it by someone else. On our side we always reproduce it with this scenario but we have also identified other similar scenarios with same behaviour.
Considering the following project: https://github.com/gaeljw/sbt-test-hangs.
.gitlab-ci.yml
stages:
- test
test_java11:
stage: test
tags:
- sbt
- java11
script:
- sbt -java-home /usr/lib/jvm/java-11 clean test
Actual behavior
Because of the underlying bug in sbt (https://github.com/sbt/sbt/issues/7429), the sbt -java-home /usr/lib/jvm/java-11 clean test
process (java
under the hood) hangs forever doing nothing.
The Gitlab CI job timeouts after 1 hour but leaves the process running on the runner machine.
Expected behavior
I would expect the runner to kill the sbt process.
Relevant logs and/or screenshots
job log
Running with gitlab-runner 16.4.1 (d89a789a)
on gitlab-runner-prod-01-1 yf3r6ahV, system ID: s_fa4d46773d5e
Preparing the "shell" executor 00:00
Using Shell (bash) executor...
Preparing environment 00:00
Running on gitlab-runner01.mycompany.net...
Getting source from Git repository 00:01
Fetching changes...
Reinitialized existing Git repository in /opt/gitlab-runner/builds/yf3r6ahV/3/mygroup/myproject/.git/
Checking out 62cb9b5f as detached HEAD (ref is refs/merge-requests/100/head)...
Removing project/project/
Removing project/target/
Removing target/
Skipping Git submodules setup
Executing "step_script" stage of the job script
$ sbt -java-home /usr/lib/jvm/java-11 clean test
[info] welcome to sbt 1.9.7 (Red Hat, Inc. Java 11.0.18)
...
Session terminated, killing shell... ...killed.
WARNING: Timed out waiting for the build to finish
ERROR: Job failed: execution took longer than 1h0m0s seconds
Process list
Captured after a few minutes of the process hanging:
root 1296228 1 0 Nov15 ? 00:01:31 /usr/bin/gitlab-runner run --working-directory /opt/gitlab-runner --config /etc/gitlab-runner/config.toml --service gitlab-runner --user gitlab-runner
root 1409674 1296228 0 08:01 ? 00:00:00 su -s /bin/bash gitlab-runner -c bash -l
gitlab-+ 1409675 1409674 0 08:01 ? 00:00:00 bash -l
gitlab-+ 1409709 1409675 0 08:01 ? 00:00:00 bash -l
gitlab-+ 1409712 1409709 99 08:01 ? 00:03:43 /usr/lib/jvm/java-11/bin/java -Dfile.encoding=UTF-8 -Dsbt.override.build.repos=true -Xms1024m -Xmx1024m -Xss4M -XX:ReservedCodeCacheSize=128m -Dsbt.script=/usr/bin/sbt -Dscala
Captured after killing the job in Gitlab UI (similar to what can be observed after the 1h timeout, will post the exact output after job timeout)
gitlab-+ 1409709 1 0 08:01 ? 00:00:00 bash -l
gitlab-+ 1409712 1409709 93 08:01 ? 00:03:44 /usr/lib/jvm/java-11/bin/java -Dfile.encoding=UTF-8 -Dsbt.override.build.repos=true -Xms1024m -Xmx1024m -Xss4M -XX:ReservedCodeCacheSize=128m -Dsbt.script=/usr/bin/sbt -Dscala.ext.d
Environment description
We are using on-premise Gitlab instance with on-premise shared runners.
$ sbt --version
sbt version in this project: 1.9.4
sbt script version: 1.9.4
$ java --version
openjdk 11.0.18 2023-01-17 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.18.0.10-3.el9) (build 11.0.18+10-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.18.0.10-3.el9) (build 11.0.18+10-LTS, mixed mode, sharing)
$ cat /etc/centos-release
CentOS Stream release 9
$ uname -a
Linux gitlab-runner01.mycompany.net 5.14.0-307.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Wed May 3 06:16:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
config.toml contents
listen_address = "gitlab-runner01.mycompany.net:10986"
concurrent = 8
check_interval = 0
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "gitlab-runner-prod-01-1"
limit = 8
url = "http://gitlab.mycompany.net/"
id = 408
token = "xxx"
token_obtained_at = 2023-11-15T10:40:24Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "shell"
Used GitLab Runner version
$ gitlab-runner -v
Version: 16.4.1
Git revision: d89a789a
Git branch: 16-4-stable
GO version: go1.20.5
Built: 2023-10-06T01:26:25+0000
OS/Arch: linux/amd64
Possible fixes
N/A
Other notes
This could be a duplicate of #2424 (closed) that was closed due to inactivity.
We only ever observed the behavior with jobs running sbt
processes but these are also 90% of our jobs at the moment, thus not sure of SBT could be a curlprit or if it's just coincidence.