Commit 85caaa16 authored by Igor's avatar Igor
Browse files

Runner Managers on Kubernetes: Note docker-machine cleanup behaviour, add...

Runner Managers on Kubernetes: Note docker-machine cleanup behaviour, add Runway to considered alternatives
parent 173f8be4
Loading
Loading
Loading
Loading
+53 −0
Original line number Diff line number Diff line
@@ -319,6 +319,36 @@ The current VM-based deployment uses a 2-hour systemd timeout (`TimeoutStopSec=2

**ArgoCD and long-running deploys:** ArgoCD does not block new syncs while pods are terminating — rapid successive deploys may cause resource pressure (multiple pod generations running). Renovate's MR-based workflow provides a natural gate.

### Docker Machine VM Cleanup on Shutdown

The current VM-based deployment includes a post-shutdown cleanup mechanism that is **not part of gitlab-runner itself**. It's implemented as a systemd `ExecStopPost` script that removes stale Docker Machine VMs after the runner manager stops:

```bash
#!/bin/bash
set -eo pipefail
parallel=${1:-1}
export MACHINE_STORAGE_PATH=${MACHINE_STORAGE_PATH:-/root/.docker/machine}
ls ${MACHINE_STORAGE_PATH}/machines/ | xargs -n 1 -P ${parallel} docker-machine rm -f
```

This script only starts **after** the gitlab-runner process has finished processing all of it's assigned jobs, which could take anywhere between few minutes to few hours. It runs with a concurrency of 3, which is slow for high-capacity shards (e.g., 600 idle VMs on small-amd64 could take 20-30 minutes). When implementing the entrypoint wrapper, increase concurrency to reduce cleanup time.

**Kubernetes limitation:** Unlike systemd's `ExecStopPost`, Kubernetes has no "post-mortem" hook. The `preStop` hook runs *before* SIGQUIT is sent, not after. Once the container terminates (gracefully or via SIGKILL after grace period), there's no built-in mechanism to run cleanup commands.

**Kubernetes implementation options:**

1. **Build into docker+machine executor.** Implement cleanup logic directly in the docker+machine executor's shutdown sequence ([gitlab-runner!6330](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/6330)). This is the cleanest solution: the executor knows which machines are in-use vs idle, allowing it to delete idle machines early rather than waiting for full job drain. No external wrapper needed.

2. **Entrypoint wrapper.** Wrap gitlab-runner with a script that traps the shutdown signal, forwards it to gitlab-runner, waits for it to exit, then runs cleanup. This mimics systemd's `ExecStopPost` behavior. Constraint: cleanup must complete within `terminationGracePeriodSeconds` (shared with job drain time).

3. **External cleanup.** Clean up orphaned VMs asynchronously via a separate process (e.g., a CronJob or controller that periodically removes stale Docker Machine VMs). Trade-off: potential cost from VMs running longer than necessary.

4. **Persistent state with StatefulSet.** Use a Kubernetes StatefulSet with persistent volumes to preserve docker-machine state across pod restarts. New pods inherit the state and continue managing existing VMs. Trade-off: StatefulSets scale down in reverse ordinal order—Kubernetes always removes the highest-ordinal pod first, regardless of job load. This means rollouts must wait for that specific pod to drain, even if other pods are idle, making deploys much slower.

**Recommendation:** Option 1 (build into docker+machine executor) is the preferred approach. It provides the cleanest integration and can optimize cleanup by deleting idle machines early. Option 3 (external cleanup) is a nice-to-have safety net for edge cases where pods crash or get OOM-killed before cleanup completes.

**Idle pool churn on deploys:** Docker-machine state is ephemeral in Kubernetes, so each deploy drains and recreates the idle VM pool. The Chef-based setup preserves state on disk, allowing config changes without cycling VMs. More frequent deploys means more churn—added cost and GCP API pressure. This is a trade-off of increased deployment velocity.

### Secrets Management

- Runner tokens already in Vault; provision fresh tokens for new runners
@@ -465,3 +495,26 @@ Use [GRIT](https://docs.gitlab.com/runner/grit/) to manage runner infrastructure
- **Limited adoption for Kubernetes.** GRIT is used primarily by internal teams (Dedicated, Demo Architecture) for VM-based deployments, though some external customers use it as well. The Helm chart is the most widely adopted method for deploying runners on Kubernetes.

**Decision:** Not selected. GRIT's primary use case is VM-based runner deployments with scheduled releases. We need Kubernetes-native deployment for continuous delivery. Signed off by VP of Infrastructure Platforms.

### Use Runway

Use [Runway](/handbook/engineering/architecture/design-documents/runway/), GitLab's internal Platform as a Service, to deploy runner managers.

**Pros:**

- Internal platform already built and maintained by GitLab
- Handles deployment, scaling, monitoring automatically
- GitLab CI integration for deployments
- Secrets management via Vault already integrated

**Cons:**

- **Stateless services only.** Runway targets "satellite services that are stateless and thus can be autoscaled." Runner managers maintain state (connections to Docker Machine VMs, in-progress jobs) and require long graceful shutdown periods (3-4 hours).

- **Cloud Run runtime limitations.** Runway uses Cloud Run Services, which have a max request timeout of 60 minutes. Runner managers need 3-4+ hours for graceful shutdown. [Cloud Run Jobs](https://cloud.google.com/run/docs/configuring/task-timeout) support longer timeouts (up to 7 days), but Jobs are designed for batch work that runs to completion—not long-running services that poll continuously.

- **Wrong execution model.** Runway expects services that expose HTTP endpoints, respond to incoming requests, and scale based on request concurrency. Runner managers are the opposite: they poll the GitLab API for jobs and push work outward to Docker Machine VMs. Request-based autoscaling doesn't make sense for this workload.

- **Network complexity.** Runner managers need direct network access to Docker Machine subnets in multiple GCP projects. Runway's Cloud Run runtime operates in its own managed environment.

**Decision:** Not selected. Runway is designed for stateless HTTP services, not long-running infrastructure components with complex networking requirements.