Investigate (and hopefully improve detection/remediation) docker-machine processes getting stuck

Issue

GitLab QA runners were unavailable due to stuck docker-machine create processes that had been running since August 12th, causing E2E pipeline failures.

Root Cause

Multiple docker-machine create processes became stuck during VM creation for the qa-runners infrastructure. The processes appeared to hang at the "pre-create checks" stage and never completed, preventing new runner VMs from being provisioned.

Key Details

5+ stuck processes identified, all running the same docker-machine create command with Google Cloud driver. The issue became noticeable on weekends when autoscaling reduced machine counts, making the stuck processes more impactful. The VMs themselves were actually created successfully in GCP, but the docker-machine processes never acknowledged completion. GitLab Runner has no timeout protection for these processes - it assumes they will eventually return.

Resolution

@stanhu manually killed the stuck docker-machine create processes, which immediately restored normal runner functionality.

Follow-up Action

  • Why docker-machine create processes get stuck (root cause analysis)
  • Can we improve the protection mechanisms to prevent this issue, likely including:
    • Adding timeouts to docker-machine create operations
    • Implementing process monitoring/cleanup for long-running creation tasks
    • Improving detection of stuck processes

Context from Slack

image

Links

This ticket was created from INC-3321 and was automatically exported by incident.io 🔥

Edited by Gonzalo Servat