@@ -319,6 +319,36 @@ The current VM-based deployment uses a 2-hour systemd timeout (`TimeoutStopSec=2
**ArgoCD and long-running deploys:** ArgoCD does not block new syncs while pods are terminating — rapid successive deploys may cause resource pressure (multiple pod generations running). Renovate's MR-based workflow provides a natural gate.
### Docker Machine VM Cleanup on Shutdown
The current VM-based deployment includes a post-shutdown cleanup mechanism that is **not part of gitlab-runner itself**. It's implemented as a systemd `ExecStopPost` script that removes stale Docker Machine VMs after the runner manager stops:
This script only starts **after** the gitlab-runner process has finished processing all of it's assigned jobs, which could take anywhere between few minutes to few hours. It runs with a concurrency of 3, which is slow for high-capacity shards (e.g., 600 idle VMs on small-amd64 could take 20-30 minutes). When implementing the entrypoint wrapper, increase concurrency to reduce cleanup time.
**Kubernetes limitation:** Unlike systemd's `ExecStopPost`, Kubernetes has no "post-mortem" hook. The `preStop` hook runs *before* SIGQUIT is sent, not after. Once the container terminates (gracefully or via SIGKILL after grace period), there's no built-in mechanism to run cleanup commands.
**Kubernetes implementation options:**
1.**Build into docker+machine executor.** Implement cleanup logic directly in the docker+machine executor's shutdown sequence ([gitlab-runner!6330](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/6330)). This is the cleanest solution: the executor knows which machines are in-use vs idle, allowing it to delete idle machines early rather than waiting for full job drain. No external wrapper needed.
2.**Entrypoint wrapper.** Wrap gitlab-runner with a script that traps the shutdown signal, forwards it to gitlab-runner, waits for it to exit, then runs cleanup. This mimics systemd's `ExecStopPost` behavior. Constraint: cleanup must complete within `terminationGracePeriodSeconds` (shared with job drain time).
3.**External cleanup.** Clean up orphaned VMs asynchronously via a separate process (e.g., a CronJob or controller that periodically removes stale Docker Machine VMs). Trade-off: potential cost from VMs running longer than necessary.
4.**Persistent state with StatefulSet.** Use a Kubernetes StatefulSet with persistent volumes to preserve docker-machine state across pod restarts. New pods inherit the state and continue managing existing VMs. Trade-off: StatefulSets scale down in reverse ordinal order—Kubernetes always removes the highest-ordinal pod first, regardless of job load. This means rollouts must wait for that specific pod to drain, even if other pods are idle, making deploys much slower.
**Recommendation:** Option 1 (build into docker+machine executor) is the preferred approach. It provides the cleanest integration and can optimize cleanup by deleting idle machines early. Option 3 (external cleanup) is a nice-to-have safety net for edge cases where pods crash or get OOM-killed before cleanup completes.
**Idle pool churn on deploys:** Docker-machine state is ephemeral in Kubernetes, so each deploy drains and recreates the idle VM pool. The Chef-based setup preserves state on disk, allowing config changes without cycling VMs. More frequent deploys means more churn—added cost and GCP API pressure. This is a trade-off of increased deployment velocity.
### Secrets Management
- Runner tokens already in Vault; provision fresh tokens for new runners
@@ -465,3 +495,26 @@ Use [GRIT](https://docs.gitlab.com/runner/grit/) to manage runner infrastructure
-**Limited adoption for Kubernetes.** GRIT is used primarily by internal teams (Dedicated, Demo Architecture) for VM-based deployments, though some external customers use it as well. The Helm chart is the most widely adopted method for deploying runners on Kubernetes.
**Decision:** Not selected. GRIT's primary use case is VM-based runner deployments with scheduled releases. We need Kubernetes-native deployment for continuous delivery. Signed off by VP of Infrastructure Platforms.
### Use Runway
Use [Runway](/handbook/engineering/architecture/design-documents/runway/), GitLab's internal Platform as a Service, to deploy runner managers.
**Pros:**
- Internal platform already built and maintained by GitLab
-**Stateless services only.** Runway targets "satellite services that are stateless and thus can be autoscaled." Runner managers maintain state (connections to Docker Machine VMs, in-progress jobs) and require long graceful shutdown periods (3-4 hours).
-**Cloud Run runtime limitations.** Runway uses Cloud Run Services, which have a max request timeout of 60 minutes. Runner managers need 3-4+ hours for graceful shutdown. [Cloud Run Jobs](https://cloud.google.com/run/docs/configuring/task-timeout) support longer timeouts (up to 7 days), but Jobs are designed for batch work that runs to completion—not long-running services that poll continuously.
-**Wrong execution model.** Runway expects services that expose HTTP endpoints, respond to incoming requests, and scale based on request concurrency. Runner managers are the opposite: they poll the GitLab API for jobs and push work outward to Docker Machine VMs. Request-based autoscaling doesn't make sense for this workload.
-**Network complexity.** Runner managers need direct network access to Docker Machine subnets in multiple GCP projects. Runway's Cloud Run runtime operates in its own managed environment.
**Decision:** Not selected. Runway is designed for stateless HTTP services, not long-running infrastructure components with complex networking requirements.