RHEL runner silent failure of jobs until timeout
Summary
When using RHEL 7.X
runners, jobs would fail silently if the runner encounters any availability issue (e.g. sudden outage, restarts, etc). The job will continue to run until it hits the configured timeout.
Steps to reproduce
- Create a
RHEL 7.x
runner. - Create a long runner pipeline (e.g. sleep).
- Run the pipeline.
- Once the job is running, reboot the runner server.
.gitlab-ci.yml
stages: # List of stages for jobs, and their order of execution
- build
build-job: # This job runs in the build stage, which runs first.
stage: build
tags:
- rhel9
script:
- echo "Compiling the code..."
- sleep 600
- echo "Compile complete."
Actual behavior
- Observe that the job continues in the
running
state and won't prompt any failure message in the logs.
Expected behavior
- The job should fail properly with the following message.
$ echo "Compiling the code..."
Compiling the code...
$ sleep 600
WARNING: after_script failed, but job will continue unaffected: context canceled
ERROR: Job failed (system failure): aborted: terminated
Relevant logs and/or screenshots
job log
Add the job log
Environment description
config.toml contents
concurrent = 1
check_interval = 0
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "<REDACTED>"
url = "<REDACTED>"
id = 21
token = "<REDACTED>"
token_obtained_at = 2024-02-02T03:52:39Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "docker"
[runners.cache]
MaxUploadedArchiveSize = 0
[runners.docker]
tls_verify = false
image = "ruby:2.7"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
network_mtu = 0
Used GitLab Runner version
Version: 16.8.0
Git revision: c72a09b6
Git branch: 16-8-stable
GO version: go1.21.5
Built: 2024-01-18T22:42:25+0000
OS/Arch: linux/amd64
Possible fixes
- Using a more recent RHEL version (RHEL 9.X) seems to have addressed the issue.