Job stuck after run (VitualBox SSH session kept open)

Summary

I’m setting up a new CI server and almost everything is smooth sailing.

most jobs are executed perfectly, but some seem to be stuck after running the steps successfully, with:

WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts

Steps to reproduce

I’m using VirtualBox for this and did some debugging on a running instance. The machine itself shows nothing running anymore (using ps). There is an ssh session left open idle. If I kill that process the jobs fails with an EOF.

The runner logs shows the job is still running, but as far as I can tell nothing is actually busy. I don’t know how else to debug that link between the gitlab-runner and the VirtualBox machine.

Although it fails randomly with any job this one seems to fail over and over again:

deploy to staging:
  stage: deploy
  environment: staging
  script: |
    eval `ssh-agent -s`
    ssh-add -t 5m <(echo "$SSH_PRIVATE_KEY_STAGING")

    if [ "$(git rev-parse origin/master)" == "$CI_COMMIT_SHA" ] ; then
      bundle exec cap staging deploy deploy:cleanup
    else
      echo "We're not on the last commit anymore... skipping deploy"
    fi
  only:
    - master

Actual behavior

The code above is executed, but after that it's stuck. When using ssh to go the the VirtualBox VM a ps auxwwf shows a open ssh session:

root         707  0.0  0.1  15720  8832 ?        Ss   13:34   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root        1630  0.0  0.1  19316 11136 ?        Ss   13:34   0:00  \_ sshd: gitlab-runner [priv]
gitlab-+    1633  0.0  0.1  19316  6472 ?        S    13:34   0:00  |   \_ sshd: gitlab-runner@notty

When I kill 1633 it'll fail with:

Saving cache for successful job
Cleaning up project directory and file based variables
ERROR: Job failed (system failure): EOF

If I don't kill it:

WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts
ERROR: Job failed: execution took longer than 2h0m0s seconds

Expected behavior

I would expect gitlab-runner to close the ssh session and proceed with rounding up

Relevant logs and/or screenshots

I don't really know what to provide here. Feel free to ask additions

Environment description

config.toml contents

concurrent = 4
check_interval = 0
user = "gitlab-runner"
shutdown_timeout = 0
# log_level = "debug"
# log_format = "json"

[session_server]
  session_timeout = 1800

[[runners]]
  name = "gitlab-ci-8....nl/..."
    url = "https://gitlab.....nl/"
  token = "..."

  executor = "virtualbox"
  [runners.cache]
    Type = "s3"
    Path = ".../cache"
    Shared = true
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
      ServerAddress = "10.0.12.96:9005"
      AccessKey = "..."
      SecretKey = "..."
      BucketName = "runner"
      Insecure = true
  [runners.ssh]
    user = "gitlab-runner"
    identity_file = "/home/gitlab-runner/.ssh/id_ed25519"
    disable_strict_host_key_checking = true
    known_hosts_file = "/home/gitlab-runner/.ssh/known_hosts"
  [runners.virtualbox]
    base_name = "dev.....nl"
    base_folder = ""
    disable_snapshots = false
    start_type = "headless"

Used GitLab Runner version

Version:      17.3.1
Git revision: 66269445
Git branch:   17-3-stable
GO version:   go1.22.5
Built:        2024-08-21T15:24:26+0000
OS/Arch:      linux/amd64

Possible fixes

I don't have any yet