Docker+machine: add shutdown drain for idle machines
Problem
When the GitLab Runner process shuts down, idle machines in the docker+machine pool are left running. Cleanup currently requires external tooling such as a systemd ExecStopPost hook that runs docker-machine rm with limited concurrency.
Solution
Add built-in, opt-in support for draining idle machines during runner shutdown. When enabled, the executor provider will remove idle machines from the pool when Shutdown() is called (after all jobs have completed, before the process exits).
This moves the drain logic into the runner itself, eliminating the need for external systemd hooks.
Note: The drain still happens after jobs complete, not in parallel. Draining while jobs are running would require splitting the "stop accepting new jobs" and "wait for current jobs" phases, which is a larger refactor. This MR focuses on providing a built-in alternative to external drain scripts.
Configuration
New global TOML section [machine] at the root level (not per-runner, since the docker+machine provider is a singleton shared across all runners):
concurrent = 10
check_interval = 0
[machine]
[machine.shutdown_drain]
enabled = true # opt-in, default: false
concurrency = 5 # parallel removals, default: 3
timeout = "10m" # drain timeout, default: uses global shutdown_timeout
max_retries = 3 # retries per machine, default: 3
retry_backoff = "5s" # base backoff (multiplied by attempt), default: 5s
[[runners]]
name = "my-runner"
executor = "docker+machine"
[runners.machine]
# ... per-runner machine config ...
Implementation
- Adds
MachineConfigstruct to globalConfigwithShutdownDrainsettings - Changes
ManagedExecutorProvider.Shutdown(ctx)interface toShutdown(ctx, config)to pass global config - Adds
ForceRemove()toMachineinterface usingdocker-machine rm -f(faster thanStop+Remove) - On shutdown, collects all idle machines and removes them concurrently up to the configured limit
- Failed removals are retried with exponential backoff
- Respects context cancellation and timeout
- Non-idle machines (in use, creating, etc.) are skipped
Why global config?
The docker+machine executor provider is a singleton - one instance shared across all [[runners]] sections. Per-runner config would require merging potentially conflicting settings, which is confusing. A single global [machine] section makes the behavior clear and predictable.