Runner Managers on Kubernetes: Note docker-machine cleanup behaviour, add... (85caaa16) · Commits · GitLab.com / Content Sites / handbook

content/handbook/engineering/architecture/design-documents/runner_managers_kubernetes/_index.md

+53 −0

Original line number	Diff line number	Diff line
		@@ -319,6 +319,36 @@ The current VM-based deployment uses a 2-hour systemd timeout (`TimeoutStopSec=2

		ArgoCD and long-running deploys: ArgoCD does not block new syncs while pods are terminating — rapid successive deploys may cause resource pressure (multiple pod generations running). Renovate's MR-based workflow provides a natural gate.

		### Docker Machine VM Cleanup on Shutdown

		The current VM-based deployment includes a post-shutdown cleanup mechanism that is not part of gitlab-runner itself. It's implemented as a systemd `ExecStopPost` script that removes stale Docker Machine VMs after the runner manager stops:

		```bash
		#!/bin/bash
		set -eo pipefail
		parallel=${1:-1}
		export MACHINE_STORAGE_PATH=${MACHINE_STORAGE_PATH:-/root/.docker/machine}
		ls ${MACHINE_STORAGE_PATH}/machines/ \| xargs -n 1 -P ${parallel} docker-machine rm -f
		```

		This script only starts after the gitlab-runner process has finished processing all of it's assigned jobs, which could take anywhere between few minutes to few hours. It runs with a concurrency of 3, which is slow for high-capacity shards (e.g., 600 idle VMs on small-amd64 could take 20-30 minutes). When implementing the entrypoint wrapper, increase concurrency to reduce cleanup time.

		Kubernetes limitation: Unlike systemd's `ExecStopPost`, Kubernetes has no "post-mortem" hook. The `preStop` hook runs before SIGQUIT is sent, not after. Once the container terminates (gracefully or via SIGKILL after grace period), there's no built-in mechanism to run cleanup commands.

		Kubernetes implementation options:

		1. Build into docker+machine executor. Implement cleanup logic directly in the docker+machine executor's shutdown sequence ([gitlab-runner!6330](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/6330)). This is the cleanest solution: the executor knows which machines are in-use vs idle, allowing it to delete idle machines early rather than waiting for full job drain. No external wrapper needed.

		2. Entrypoint wrapper. Wrap gitlab-runner with a script that traps the shutdown signal, forwards it to gitlab-runner, waits for it to exit, then runs cleanup. This mimics systemd's `ExecStopPost` behavior. Constraint: cleanup must complete within `terminationGracePeriodSeconds` (shared with job drain time).

		3. External cleanup. Clean up orphaned VMs asynchronously via a separate process (e.g., a CronJob or controller that periodically removes stale Docker Machine VMs). Trade-off: potential cost from VMs running longer than necessary.

		4. Persistent state with StatefulSet. Use a Kubernetes StatefulSet with persistent volumes to preserve docker-machine state across pod restarts. New pods inherit the state and continue managing existing VMs. Trade-off: StatefulSets scale down in reverse ordinal order—Kubernetes always removes the highest-ordinal pod first, regardless of job load. This means rollouts must wait for that specific pod to drain, even if other pods are idle, making deploys much slower.

		Recommendation: Option 1 (build into docker+machine executor) is the preferred approach. It provides the cleanest integration and can optimize cleanup by deleting idle machines early. Option 3 (external cleanup) is a nice-to-have safety net for edge cases where pods crash or get OOM-killed before cleanup completes.

		Idle pool churn on deploys: Docker-machine state is ephemeral in Kubernetes, so each deploy drains and recreates the idle VM pool. The Chef-based setup preserves state on disk, allowing config changes without cycling VMs. More frequent deploys means more churn—added cost and GCP API pressure. This is a trade-off of increased deployment velocity.

		### Secrets Management

		- Runner tokens already in Vault; provision fresh tokens for new runners
		@@ -465,3 +495,26 @@ Use [GRIT](https://docs.gitlab.com/runner/grit/) to manage runner infrastructure
		- Limited adoption for Kubernetes. GRIT is used primarily by internal teams (Dedicated, Demo Architecture) for VM-based deployments, though some external customers use it as well. The Helm chart is the most widely adopted method for deploying runners on Kubernetes.

		Decision: Not selected. GRIT's primary use case is VM-based runner deployments with scheduled releases. We need Kubernetes-native deployment for continuous delivery. Signed off by VP of Infrastructure Platforms.

		### Use Runway

		Use [Runway](/handbook/engineering/architecture/design-documents/runway/), GitLab's internal Platform as a Service, to deploy runner managers.

		Pros:

		- Internal platform already built and maintained by GitLab
		- Handles deployment, scaling, monitoring automatically
		- GitLab CI integration for deployments
		- Secrets management via Vault already integrated

		Cons:

		- Stateless services only. Runway targets "satellite services that are stateless and thus can be autoscaled." Runner managers maintain state (connections to Docker Machine VMs, in-progress jobs) and require long graceful shutdown periods (3-4 hours).

		- Cloud Run runtime limitations. Runway uses Cloud Run Services, which have a max request timeout of 60 minutes. Runner managers need 3-4+ hours for graceful shutdown. [Cloud Run Jobs](https://cloud.google.com/run/docs/configuring/task-timeout) support longer timeouts (up to 7 days), but Jobs are designed for batch work that runs to completion—not long-running services that poll continuously.

		- Wrong execution model. Runway expects services that expose HTTP endpoints, respond to incoming requests, and scale based on request concurrency. Runner managers are the opposite: they poll the GitLab API for jobs and push work outward to Docker Machine VMs. Request-based autoscaling doesn't make sense for this workload.

		- Network complexity. Runner managers need direct network access to Docker Machine subnets in multiple GCP projects. Runway's Cloud Run runtime operates in its own managed environment.

		Decision: Not selected. Runway is designed for stateless HTTP services, not long-running infrastructure components with complex networking requirements.

Original line number	Diff line number	Diff line
		@@ -319,6 +319,36 @@ The current VM-based deployment uses a 2-hour systemd timeout (`TimeoutStopSec=2

		ArgoCD and long-running deploys: ArgoCD does not block new syncs while pods are terminating — rapid successive deploys may cause resource pressure (multiple pod generations running). Renovate's MR-based workflow provides a natural gate.

		### Docker Machine VM Cleanup on Shutdown

		The current VM-based deployment includes a post-shutdown cleanup mechanism that is not part of gitlab-runner itself. It's implemented as a systemd `ExecStopPost` script that removes stale Docker Machine VMs after the runner manager stops:

		```bash
		#!/bin/bash
		set -eo pipefail
		parallel=${1:-1}
		export MACHINE_STORAGE_PATH=${MACHINE_STORAGE_PATH:-/root/.docker/machine}
		ls ${MACHINE_STORAGE_PATH}/machines/ \| xargs -n 1 -P ${parallel} docker-machine rm -f
		```

		This script only starts after the gitlab-runner process has finished processing all of it's assigned jobs, which could take anywhere between few minutes to few hours. It runs with a concurrency of 3, which is slow for high-capacity shards (e.g., 600 idle VMs on small-amd64 could take 20-30 minutes). When implementing the entrypoint wrapper, increase concurrency to reduce cleanup time.

		Kubernetes limitation: Unlike systemd's `ExecStopPost`, Kubernetes has no "post-mortem" hook. The `preStop` hook runs before SIGQUIT is sent, not after. Once the container terminates (gracefully or via SIGKILL after grace period), there's no built-in mechanism to run cleanup commands.

		Kubernetes implementation options:

		1. Build into docker+machine executor. Implement cleanup logic directly in the docker+machine executor's shutdown sequence ([gitlab-runner!6330](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/6330)). This is the cleanest solution: the executor knows which machines are in-use vs idle, allowing it to delete idle machines early rather than waiting for full job drain. No external wrapper needed.

		2. Entrypoint wrapper. Wrap gitlab-runner with a script that traps the shutdown signal, forwards it to gitlab-runner, waits for it to exit, then runs cleanup. This mimics systemd's `ExecStopPost` behavior. Constraint: cleanup must complete within `terminationGracePeriodSeconds` (shared with job drain time).

		3. External cleanup. Clean up orphaned VMs asynchronously via a separate process (e.g., a CronJob or controller that periodically removes stale Docker Machine VMs). Trade-off: potential cost from VMs running longer than necessary.

		4. Persistent state with StatefulSet. Use a Kubernetes StatefulSet with persistent volumes to preserve docker-machine state across pod restarts. New pods inherit the state and continue managing existing VMs. Trade-off: StatefulSets scale down in reverse ordinal order—Kubernetes always removes the highest-ordinal pod first, regardless of job load. This means rollouts must wait for that specific pod to drain, even if other pods are idle, making deploys much slower.

		Recommendation: Option 1 (build into docker+machine executor) is the preferred approach. It provides the cleanest integration and can optimize cleanup by deleting idle machines early. Option 3 (external cleanup) is a nice-to-have safety net for edge cases where pods crash or get OOM-killed before cleanup completes.

		Idle pool churn on deploys: Docker-machine state is ephemeral in Kubernetes, so each deploy drains and recreates the idle VM pool. The Chef-based setup preserves state on disk, allowing config changes without cycling VMs. More frequent deploys means more churn—added cost and GCP API pressure. This is a trade-off of increased deployment velocity.

		### Secrets Management

		- Runner tokens already in Vault; provision fresh tokens for new runners
		@@ -465,3 +495,26 @@ Use [GRIT](https://docs.gitlab.com/runner/grit/) to manage runner infrastructure
		- Limited adoption for Kubernetes. GRIT is used primarily by internal teams (Dedicated, Demo Architecture) for VM-based deployments, though some external customers use it as well. The Helm chart is the most widely adopted method for deploying runners on Kubernetes.

		Decision: Not selected. GRIT's primary use case is VM-based runner deployments with scheduled releases. We need Kubernetes-native deployment for continuous delivery. Signed off by VP of Infrastructure Platforms.

		### Use Runway

		Use [Runway](/handbook/engineering/architecture/design-documents/runway/), GitLab's internal Platform as a Service, to deploy runner managers.

		Pros:

		- Internal platform already built and maintained by GitLab
		- Handles deployment, scaling, monitoring automatically
		- GitLab CI integration for deployments
		- Secrets management via Vault already integrated

		Cons:

		- Stateless services only. Runway targets "satellite services that are stateless and thus can be autoscaled." Runner managers maintain state (connections to Docker Machine VMs, in-progress jobs) and require long graceful shutdown periods (3-4 hours).

		- Cloud Run runtime limitations. Runway uses Cloud Run Services, which have a max request timeout of 60 minutes. Runner managers need 3-4+ hours for graceful shutdown. [Cloud Run Jobs](https://cloud.google.com/run/docs/configuring/task-timeout) support longer timeouts (up to 7 days), but Jobs are designed for batch work that runs to completion—not long-running services that poll continuously.

		- Wrong execution model. Runway expects services that expose HTTP endpoints, respond to incoming requests, and scale based on request concurrency. Runner managers are the opposite: they poll the GitLab API for jobs and push work outward to Docker Machine VMs. Request-based autoscaling doesn't make sense for this workload.

		- Network complexity. Runner managers need direct network access to Docker Machine subnets in multiple GCP projects. Runway's Cloud Run runtime operates in its own managed environment.

		Decision: Not selected. Runway is designed for stateless HTTP services, not long-running infrastructure components with complex networking requirements.