virtualbox leaves dangling VM when job is cancelled.
i'm running a self-hosted GitLab-CE (16.8.1, Omnibus) instance, with self-hosted gitlab-runners.
One of my runner machines is using the VirtualBox executor, to run isolated macOS builds (for which I cannot use docker
).
As there are different VMs (with different macOS versions), i've registered multiple runners, but keep the concurrent
level to 1 (as this is a somewhat old mac mini).
concurrent = 1
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "vboxrunner1"
url = "https://git.example.com/"
token = "1234567890abcdef1234567890abcd"
executor = "virtualbox"
output_limit = 65536
[runners.cache]
[runners.ssh]
user = "admin"
password = "SECRET"
disable_strict_host_key_checking = true
[runners.virtualbox]
base_name = "box"
base_snapshot = "gitlab-ci"
base_folder = ""
disable_snapshots = true
[[runners]]
name = "vboxrunner2"
url = "https://git.example.com/"
token = "XXXXXXXXXXXXX"
executor = "virtualbox"
output_limit = 65536
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[runners.ssh]
user = "admin"
password = "SECRET"
disable_strict_host_key_checking = true
[runners.virtualbox]
base_name = "otherbox"
base_snapshot = "gitlab-ci"
base_folder = ""
disable_snapshots = true
The VMs (config, disk-images,...) live in /home/gitlab-runner/VMs/
Anyhow, this usually works fine.
When a new job is created on the runner (e.g. for the vboxrunner1
), a VM is cloned to /home/gitlab-runner/VMs/box-runner-XXXXXXXX-project-123-concurrent-0/
(so it encodes the base VM image as well as the project ID).
Usually, this VM is destroyed after the job finishes (and the folder removed)
However, if a job is manually cancelled by a user (e.g. because a pipeline that is known to take long, is already known to not produce any meaningful things), the VM is not cleaned up correctly.
Esp. there is still a /gitlab-runner/VMs/box-runner-XXXXXXXX-project-123-concurrent-0/box-runner-XXXXXXXX-project-123-concurrent-0.vbox
file lingering around.
This effectively prevents any future jobs for the given project using the same runner (as the generated VM name will be the same as the already existing file), and I get errors:
Creating new VM...
ERROR: Preparation failed: VBoxManageOutput error: VBoxManage: error: Machine settings file '/home/gitlab-runner/VMs/box-runner-XXXXXXXX-project-123-concurrent-0/box-runner-XXXXXXXX-project-123-concurrent-0.vbox' already exists
VBoxManage: error: Details: code VBOX_E_FILE_ERROR (0x80bb0004), component MachineWrap, interface IMachine, callee nsISupports
VBoxManage: error: Context: "CreateMachine(bstrSettingsFile.raw(), Bstr(pszTrgName).raw(), ComSafeArrayAsInParam(groups), NULL, createFlags.raw(), NULL, NULL, NULL, trgMachine.asOutParam())" at line 704 of file VBoxManageMisc.cpp
Will be retried in 3s ...
The error only goes away, if I manually log into the host and delete the dangling directory.
This is somewhat similar to #30850 (closed), but I've decided to create a separate issue, as in my case there is no need to trigger any unexpected event (like OOM or killing gitlab-runner or reboot the machine). Instead a perfectly valid user action ("cancelling a job") exhibits the problem.
Note: that I'm not absolutely sure that this is always triggered when the user cancels a job, or just in some given stage (e.g. while the step_script
is executed; or while the VM is setup; or...)