Job stuck after run (VitualBox SSH session kept open)
Summary
I’m setting up a new CI server and almost everything is smooth sailing.
most jobs are executed perfectly, but some seem to be stuck after running the steps successfully, with:
WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts
Steps to reproduce
I’m using VirtualBox for this and did some debugging on a running instance. The machine itself shows nothing running anymore (using ps). There is an ssh session left open idle. If I kill that process the jobs fails with an EOF.
The runner logs shows the job is still running, but as far as I can tell nothing is actually busy. I don’t know how else to debug that link between the gitlab-runner and the VirtualBox machine.
Although it fails randomly with any job this one seems to fail over and over again:
deploy to staging:
stage: deploy
environment: staging
script: |
eval `ssh-agent -s`
ssh-add -t 5m <(echo "$SSH_PRIVATE_KEY_STAGING")
if [ "$(git rev-parse origin/master)" == "$CI_COMMIT_SHA" ] ; then
bundle exec cap staging deploy deploy:cleanup
else
echo "We're not on the last commit anymore... skipping deploy"
fi
only:
- master
Actual behavior
The code above is executed, but after that it's stuck. When using ssh to go the the VirtualBox VM a ps auxwwf shows a open ssh session:
root 707 0.0 0.1 15720 8832 ? Ss 13:34 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 1630 0.0 0.1 19316 11136 ? Ss 13:34 0:00 \_ sshd: gitlab-runner [priv]
gitlab-+ 1633 0.0 0.1 19316 6472 ? S 13:34 0:00 | \_ sshd: gitlab-runner@notty
When I kill 1633 it'll fail with:
Saving cache for successful job
Cleaning up project directory and file based variables
ERROR: Job failed (system failure): EOF
If I don't kill it:
WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts
ERROR: Job failed: execution took longer than 2h0m0s seconds
Expected behavior
I would expect gitlab-runner to close the ssh session and proceed with rounding up
Relevant logs and/or screenshots
I don't really know what to provide here. Feel free to ask additions
Environment description
config.toml contents
concurrent = 4
check_interval = 0
user = "gitlab-runner"
shutdown_timeout = 0
# log_level = "debug"
# log_format = "json"
[session_server]
session_timeout = 1800
[[runners]]
name = "gitlab-ci-8....nl/..."
url = "https://gitlab.....nl/"
token = "..."
executor = "virtualbox"
[runners.cache]
Type = "s3"
Path = ".../cache"
Shared = true
MaxUploadedArchiveSize = 0
[runners.cache.s3]
ServerAddress = "10.0.12.96:9005"
AccessKey = "..."
SecretKey = "..."
BucketName = "runner"
Insecure = true
[runners.ssh]
user = "gitlab-runner"
identity_file = "/home/gitlab-runner/.ssh/id_ed25519"
disable_strict_host_key_checking = true
known_hosts_file = "/home/gitlab-runner/.ssh/known_hosts"
[runners.virtualbox]
base_name = "dev.....nl"
base_folder = ""
disable_snapshots = false
start_type = "headless"
Used GitLab Runner version
Version: 17.3.1
Git revision: 66269445
Git branch: 17-3-stable
GO version: go1.22.5
Built: 2024-08-21T15:24:26+0000
OS/Arch: linux/amd64
Possible fixes
I don't have any yet