[Spike] Replace vcluster and native review jobs with per-job k3d clusters to eliminate shared host contention
Problem
The review_*, review_specs_*, and qa_* CI jobs fail frequently with:
helm upgrade --installtimeouts (900s limit)- Rollout /
kubectl rollout statustimeouts - Pod scheduling failures (pods stuck Pending)
These failures happen across both vcluster and native GKE/EKS environments.
Current Architecture
Each MR already gets its own vcluster (rvw-{CI_PIPELINE_ID}) running inside a shared GKE host cluster (gkevc-ci-cluster). This provides namespace-level isolation, but all vclusters still share the same GKE node pool. When many MRs run pipelines concurrently:
- GKE autoscaler can't provision new nodes fast enough (2–4 min per node)
- vcluster pods (control plane, workloads) stay
Pendingduring that window helm --wait --timeout 900sfires before nodes join- The job fails — even though there's nothing wrong with the chart itself
The GKE 1.35 and EKS 1.34 "native" environments (no vcluster) suffer the same starvation on their own shared clusters.
Proposed Solution
Give every CI job its own fully-isolated Kubernetes cluster with dedicated CPU/RAM. No shared host; no autoscaling contention.
Tool: k3d — k3s packaged as Docker containers. It:
- Runs entirely inside the GitLab runner's Docker environment (no VM)
- Starts in ~30 seconds
- Supports pinned K8s versions (
--image rancher/k3s:v1.35.0-k3s1) - Includes a built-in LoadBalancer (via Traefik + k3d port mapping)
- Supports ARM64
- Requires only a Docker executor with
privileged: true
Proposed Job Topology
Option B (recommended): one k3d cluster per test job
The review_* (deploy-only) job is eliminated. Each downstream test job becomes self-contained:
review_specs_k3d_v135:
script:
- k3d_install && k3d_create # ~30s
- deploy_external_services # Valkey, CNPG, Garage
- helm upgrade --install --wait # GitLab chart
- run_specs # feature specs against local cluster
- k3d_delete
qa_k3d_v135: (parallel: 5)
script:
- k3d_install && k3d_create
- deploy_external_services
- helm upgrade --install --wait
- run_qa_tests # each worker tests its own instance
- k3d_deleteEach parallel QA worker gets its own independent GitLab instance. The qa_report stage is unchanged — it still aggregates JUnit artifacts.
Trade-off: Each job pays a ~10–15 min deploy overhead, so total compute is higher. Wall time is similar because jobs are parallel. This is acceptable given that the current shared approach produces unreliable results regardless of timing.
Key Technical Challenges
1. DNS and TLS
Current setup uses ExternalDNS + wildcard cert on *.cloud-native-vcluster.helm-charts.win. With k3d inside a CI container, the cluster is on the Docker bridge — no public DNS.
Proposed approach for PoC: use nip.io with the runner's Docker bridge gateway IP:
KUBE_INGRESS_BASE_DOMAIN=$(ip route get 1 | awk '{print $7; exit}').nip.io
# e.g. 172.17.0.1.nip.io — resolves back to 172.17.0.1TLS: disable for the PoC (global.ingress.tls.enabled=false) or use a self-signed cert accepted by the spec runner.
For production rollout, a dedicated *.ci-local.helm-charts.win zone pointing at a static runner IP range could restore proper TLS without ExternalDNS.
2. Agent-based cluster connectivity
Current jobs use kubectl config use-context ${AGENT_PROJECT_PATH}:${AGENT_NAME}. With a local k3d cluster, we skip the GitLab Agent entirely and use the kubeconfig exported by k3d directly.
Change required in scripts/ci/autodevops.sh set_context(): when K3D_MODE=true, skip the agent context switch.
3. K8s version matrix
k3d supports the same version matrix via --image rancher/k3s:<version>. All five current environments (v1.33, v1.34, v1.35, ARM64, Flux) can be preserved by parameterising K3D_K8S_VERSION.
4. External services
Valkey, CloudNativePG, and Garage are deployed inside the cluster via Helm. They work identically inside k3d. No changes to scripts/ci/lib/{valkey,cloudnativepg,garage}.sh.
Phased Approach
Phase 1 — PoC (single environment, merged job)
Validate that k3d can deploy the full GitLab chart reliably:
- New
scripts/ci/k3d.sh(lifecycle: install, create, kubeconfig, delete) - New
scripts/ci/k3d_deploy.sh(mirrorsvcluster_deploy.sh, writes nip.io URL toVARIABLES_FILE) - New
.gitlab/ci/k3d-review-apps.gitlab-ci.yml(base template, merged deploy+specs) - New
.gitlab/ci/environments/k3d.135.amd64.gitlab-ci.yml(gated behindLIMIT_TO=k3d135) - Minor update to
scripts/ci/autodevops.shset_context()
All existing vcluster jobs remain untouched during the PoC.
Phase 2 — Per-job topology
If the PoC passes: restructure all environment configs to use one k3d cluster per test job. Eliminate the review_* (deploy-only) stage.
Phase 3 — Full rollout
Replace all vcluster environment configs with k3d equivalents. Remove vcluster infrastructure dependencies.
Files to Change (PoC only)
| File | Change |
|---|---|
scripts/ci/k3d.sh |
New — cluster lifecycle |
scripts/ci/k3d_deploy.sh |
New — deploy wrapper with nip.io URL |
.gitlab/ci/k3d-review-apps.gitlab-ci.yml |
New — base k3d CI template |
.gitlab/ci/environments/k3d.135.amd64.gitlab-ci.yml |
New — PoC environment config |
scripts/ci/autodevops.sh |
set_context() — skip agent when K3D_MODE=true |
.gitlab-ci.yml |
Include new k3d template, add K3D_MODE variable |
Acceptance Criteria (PoC)
-
LIMIT_TO=k3d135pipeline runsreview_k3d_v135and completes without timeout - k3d cluster creates cleanly with external services (Valkey, CNPG, Garage)
-
helm upgrade --installcompletes within 900s - Feature specs reach GitLab via nip.io URL and pass
- Running 3–5 concurrent pipelines shows no resource contention (5 parallel QA workers each get their own isolated cluster)
- p50 job duration is comparable to (or better than) the existing
review_v135vcluster job
Related
- Current vcluster scripts:
scripts/ci/vcluster.sh,scripts/ci/vcluster_deploy.sh - Current environment configs:
.gitlab/ci/environments/vcluster.*.gitlab-ci.yml - Current base template:
.gitlab/ci/vcluster-review-apps.gitlab-ci.yml
Implementation Status
Phase 1 — PoC (!4967 (merged)) ✅ In review
All acceptance criteria met. See MR !4967 (merged) for details.
Findings during PoC:
- GitLab 17+ PAT tokens use a routing suffix with dots (e.g.
glpat-xxx.01.yyy) — the grep pattern extracting the token fromgitlab-rails runneroutput must include.in the character class. - SSH push-over-SSH specs require
--port "22:22@loadbalancer"onk3d cluster create; the NGINX ingress TCP config for port 22 → gitlab-shell is already in the chart defaults. ip route getfield position for the source IP varies with routing topology; parse bysrckeyword, not position.- The
e2erunner tag (GitLab internal DinD fleet) is required.saas-linux-large-amd64runners do not have DinD pre-configured. Thee2efleet runs privileged DinD with sufficient resources for the full GitLab workload. - The CI/CD
GITLAB_QA_ADMIN_ACCESS_TOKENis scoped to the shared vcluster instance and returns 401 against a fresh k3d deployment. The QA job mints a fresh admin PAT viagitlab-rails runner, exports it asGITLAB_ADMIN_TOKENinVARIABLES_FILE, then re-exports asGITLAB_QA_ADMIN_ACCESS_TOKENso gitlab-qa picks it up. Basic HTTP auth does not work reliably against freshly-deployed instances.
Planned Follow-up MRs
| # | MR | Branch | Status | Description |
|---|---|---|---|---|
| 1 | !4967 (merged) | feature/k3d-per-job-review-env |
In review — pipeline |
Phase 1 PoC: k3d v1.35 amd64, NGINX ingress |
| 2 | !4984 (merged) | feature/k3d-envoy-gateway |
Draft | Replace NGINX ingress with Envoy Gateway (Gateway API) |
| 3 | !4982 (merged) | feature/k3d-k8s-matrix |
Draft | Extend matrix: k3d v1.33, v1.34, v1.35 ARM64 |
| 4 | !4983 (merged) | feature/k3d-manual-full-suite |
Draft | Add qa_k3d_*_manual_full_suite jobs (parallel: 7, when: manual) |
| 5 | !4985 (merged) | feature/k3d-remove-native |
Draft (WIP) | Remove vcluster/EKS/GKE trigger_review_* jobs; keep one GKE nightly (stable channel, no vcluster) |
MRs are sequenced — each is blocked by the previous (GitLab MR dependencies). Merge order: !4967 (merged) → !4984 (merged) → !4982 (merged) → !4983 (merged) → !4985 (merged).
MRs 2–5 temporarily target feature/k3d-per-job-review-env and will be retargeted to master before each merge.
Notes on "Remove Native Tests" (MR 5)
The current trigger_review_current and trigger_review_secondary jobs trigger child pipelines for 7 environments (v133, v134, v135, v135a/arm64, flux, gke135, eks134). Once the k3d matrix fully covers the K8s version matrix, these become redundant.
Keep: one scheduled nightly pipeline deploying to a real GKE cluster using the GCP stable release channel. This validates against a real cloud provider without needing vcluster — GCP manages node upgrades automatically. No vcluster infrastructure required.
Remove: all vcluster environment configs, the trigger_review_current/trigger_review_secondary manual gates, and the EKS environment.
Phase 4 — Cost/Benefit Evaluation
Before fully retiring the vcluster/GKE infrastructure, we need to evaluate whether the k3d approach is cost-neutral or better.
Key question: the current GKE cluster (gkevc-ci-cluster) is a static, always-on cost. The k3d approach trades that for per-minute ephemeral runner cost on large (e2e-tagged) instances. The crossover point depends on how many concurrent pipelines are running at any given time.
What to measure:
- Average and peak number of concurrent review pipelines per day
- Cost per
e2erunner-minute (large DinD fleet) vs. monthly GKE cluster cost - Whether the GKE cluster can be fully decommissioned (no other workloads depending on it) or only right-sized
- Runner compute time added per MR pipeline by the self-provisioning overhead (~10–15 min deploy per job × number of parallel jobs)
Expected outcome: the GKE cluster runs 24/7 regardless of pipeline activity, so at low-to-moderate pipeline concurrency the ephemeral model should be cheaper or equivalent. At very high concurrency it may be more expensive, but the reliability improvement justifies the cost in either case.
Action: gather 30-day pipeline metrics from GitLab CI analytics and GCP billing, then produce a before/after estimate. This can be done in parallel with the MR merge sequence and does not block !4985 (merged).