Docker+machine: add shutdown drain for idle machines (!6330) · Merge requests · GitLab.org / gitlab-runner

Problem

When the GitLab Runner process shuts down, idle machines in the docker+machine pool are left running. Cleanup currently requires external tooling such as a systemd ExecStopPost hook that runs docker-machine rm with limited concurrency.

Solution

Add built-in, opt-in support for draining idle machines during runner shutdown. When enabled, the executor provider will remove idle machines from the pool when Shutdown() is called (after all jobs have completed, before the process exits).

This moves the drain logic into the runner itself, eliminating the need for external systemd hooks.

Note: The drain still happens after jobs complete, not in parallel. Draining while jobs are running would require splitting the "stop accepting new jobs" and "wait for current jobs" phases, which is a larger refactor. This MR focuses on providing a built-in alternative to external drain scripts.

Configuration

New global TOML section [machine] at the root level (not per-runner, since the docker+machine provider is a singleton shared across all runners):

concurrent = 10
check_interval = 0

[machine]
  [machine.shutdown_drain]
    enabled = true        # opt-in, default: false
    concurrency = 5       # parallel removals, default: 3  
    timeout = "10m"       # drain timeout, default: uses global shutdown_timeout
    max_retries = 3       # retries per machine, default: 3
    retry_backoff = "5s"  # base backoff (multiplied by attempt), default: 5s

[[runners]]
  name = "my-runner"
  executor = "docker+machine"
  [runners.machine]
    # ... per-runner machine config ...

Implementation

Adds MachineConfig struct to global Config with ShutdownDrain settings
Changes ManagedExecutorProvider.Shutdown(ctx) interface to Shutdown(ctx, config) to pass global config
Adds ForceRemove() to Machine interface using docker-machine rm -f (faster than Stop + Remove)
On shutdown, collects all idle machines and removes them concurrently up to the configured limit
Failed removals are retried with exponential backoff
Respects context cancellation and timeout
Non-idle machines (in use, creating, etc.) are skipped

Why global config?

The docker+machine executor provider is a singleton - one instance shared across all [[runners]] sections. Per-runner config would require merging potentially conflicting settings, which is confusing. A single global [machine] section makes the behavior clear and predictable.

References

gitlab-com/gl-infra/production-engineering#28168

Edited Jan 29, 2026 by Igor

Docker+machine: add shutdown drain for idle machines