[Spike] Replace vcluster and native review jobs with per-job k3d clusters to eliminate shared host contention

Problem

The review_*, review_specs_*, and qa_* CI jobs fail frequently with:

  • helm upgrade --install timeouts (900s limit)
  • Rollout / kubectl rollout status timeouts
  • Pod scheduling failures (pods stuck Pending)

These failures happen across both vcluster and native GKE/EKS environments.

Current Architecture

Each MR already gets its own vcluster (rvw-{CI_PIPELINE_ID}) running inside a shared GKE host cluster (gkevc-ci-cluster). This provides namespace-level isolation, but all vclusters still share the same GKE node pool. When many MRs run pipelines concurrently:

  1. GKE autoscaler can't provision new nodes fast enough (2–4 min per node)
  2. vcluster pods (control plane, workloads) stay Pending during that window
  3. helm --wait --timeout 900s fires before nodes join
  4. The job fails — even though there's nothing wrong with the chart itself

The GKE 1.35 and EKS 1.34 "native" environments (no vcluster) suffer the same starvation on their own shared clusters.

Proposed Solution

Give every CI job its own fully-isolated Kubernetes cluster with dedicated CPU/RAM. No shared host; no autoscaling contention.

Tool: k3d — k3s packaged as Docker containers. It:

  • Runs entirely inside the GitLab runner's Docker environment (no VM)
  • Starts in ~30 seconds
  • Supports pinned K8s versions (--image rancher/k3s:v1.35.0-k3s1)
  • Includes a built-in LoadBalancer (via Traefik + k3d port mapping)
  • Supports ARM64
  • Requires only a Docker executor with privileged: true

Proposed Job Topology

The review_* (deploy-only) job is eliminated. Each downstream test job becomes self-contained:

review_specs_k3d_v135:
  script:
    - k3d_install && k3d_create          # ~30s
    - deploy_external_services           # Valkey, CNPG, Garage
    - helm upgrade --install --wait      # GitLab chart
    - run_specs                          # feature specs against local cluster
    - k3d_delete

qa_k3d_v135: (parallel: 5)
  script:
    - k3d_install && k3d_create
    - deploy_external_services
    - helm upgrade --install --wait
    - run_qa_tests                       # each worker tests its own instance
    - k3d_delete

Each parallel QA worker gets its own independent GitLab instance. The qa_report stage is unchanged — it still aggregates JUnit artifacts.

Trade-off: Each job pays a ~10–15 min deploy overhead, so total compute is higher. Wall time is similar because jobs are parallel. This is acceptable given that the current shared approach produces unreliable results regardless of timing.

Key Technical Challenges

1. DNS and TLS

Current setup uses ExternalDNS + wildcard cert on *.cloud-native-vcluster.helm-charts.win. With k3d inside a CI container, the cluster is on the Docker bridge — no public DNS.

Proposed approach for PoC: use nip.io with the runner's Docker bridge gateway IP:

KUBE_INGRESS_BASE_DOMAIN=$(ip route get 1 | awk '{print $7; exit}').nip.io
# e.g. 172.17.0.1.nip.io — resolves back to 172.17.0.1

TLS: disable for the PoC (global.ingress.tls.enabled=false) or use a self-signed cert accepted by the spec runner.

For production rollout, a dedicated *.ci-local.helm-charts.win zone pointing at a static runner IP range could restore proper TLS without ExternalDNS.

2. Agent-based cluster connectivity

Current jobs use kubectl config use-context ${AGENT_PROJECT_PATH}:${AGENT_NAME}. With a local k3d cluster, we skip the GitLab Agent entirely and use the kubeconfig exported by k3d directly.

Change required in scripts/ci/autodevops.sh set_context(): when K3D_MODE=true, skip the agent context switch.

3. K8s version matrix

k3d supports the same version matrix via --image rancher/k3s:<version>. All five current environments (v1.33, v1.34, v1.35, ARM64, Flux) can be preserved by parameterising K3D_K8S_VERSION.

4. External services

Valkey, CloudNativePG, and Garage are deployed inside the cluster via Helm. They work identically inside k3d. No changes to scripts/ci/lib/{valkey,cloudnativepg,garage}.sh.

Phased Approach

Phase 1 — PoC (single environment, merged job)

Validate that k3d can deploy the full GitLab chart reliably:

  • New scripts/ci/k3d.sh (lifecycle: install, create, kubeconfig, delete)
  • New scripts/ci/k3d_deploy.sh (mirrors vcluster_deploy.sh, writes nip.io URL to VARIABLES_FILE)
  • New .gitlab/ci/k3d-review-apps.gitlab-ci.yml (base template, merged deploy+specs)
  • New .gitlab/ci/environments/k3d.135.amd64.gitlab-ci.yml (gated behind LIMIT_TO=k3d135)
  • Minor update to scripts/ci/autodevops.sh set_context()

All existing vcluster jobs remain untouched during the PoC.

Phase 2 — Per-job topology

If the PoC passes: restructure all environment configs to use one k3d cluster per test job. Eliminate the review_* (deploy-only) stage.

Phase 3 — Full rollout

Replace all vcluster environment configs with k3d equivalents. Remove vcluster infrastructure dependencies.

Files to Change (PoC only)

File Change
scripts/ci/k3d.sh New — cluster lifecycle
scripts/ci/k3d_deploy.sh New — deploy wrapper with nip.io URL
.gitlab/ci/k3d-review-apps.gitlab-ci.yml New — base k3d CI template
.gitlab/ci/environments/k3d.135.amd64.gitlab-ci.yml New — PoC environment config
scripts/ci/autodevops.sh set_context() — skip agent when K3D_MODE=true
.gitlab-ci.yml Include new k3d template, add K3D_MODE variable

Acceptance Criteria (PoC)

  • LIMIT_TO=k3d135 pipeline runs review_k3d_v135 and completes without timeout
  • k3d cluster creates cleanly with external services (Valkey, CNPG, Garage)
  • helm upgrade --install completes within 900s
  • Feature specs reach GitLab via nip.io URL and pass
  • Running 3–5 concurrent pipelines shows no resource contention (5 parallel QA workers each get their own isolated cluster)
  • p50 job duration is comparable to (or better than) the existing review_v135 vcluster job
  • Current vcluster scripts: scripts/ci/vcluster.sh, scripts/ci/vcluster_deploy.sh
  • Current environment configs: .gitlab/ci/environments/vcluster.*.gitlab-ci.yml
  • Current base template: .gitlab/ci/vcluster-review-apps.gitlab-ci.yml

Implementation Status

Phase 1 — PoC (!4967 (merged)) In review

All acceptance criteria met. See MR !4967 (merged) for details.

Findings during PoC:

  • GitLab 17+ PAT tokens use a routing suffix with dots (e.g. glpat-xxx.01.yyy) — the grep pattern extracting the token from gitlab-rails runner output must include . in the character class.
  • SSH push-over-SSH specs require --port "22:22@loadbalancer" on k3d cluster create; the NGINX ingress TCP config for port 22 → gitlab-shell is already in the chart defaults.
  • ip route get field position for the source IP varies with routing topology; parse by src keyword, not position.
  • The e2e runner tag (GitLab internal DinD fleet) is required. saas-linux-large-amd64 runners do not have DinD pre-configured. The e2e fleet runs privileged DinD with sufficient resources for the full GitLab workload.
  • The CI/CD GITLAB_QA_ADMIN_ACCESS_TOKEN is scoped to the shared vcluster instance and returns 401 against a fresh k3d deployment. The QA job mints a fresh admin PAT via gitlab-rails runner, exports it as GITLAB_ADMIN_TOKEN in VARIABLES_FILE, then re-exports as GITLAB_QA_ADMIN_ACCESS_TOKEN so gitlab-qa picks it up. Basic HTTP auth does not work reliably against freshly-deployed instances.

Planned Follow-up MRs

# MR Branch Status Description
1 !4967 (merged) feature/k3d-per-job-review-env In review — pipeline Phase 1 PoC: k3d v1.35 amd64, NGINX ingress
2 !4984 (merged) feature/k3d-envoy-gateway Draft Replace NGINX ingress with Envoy Gateway (Gateway API)
3 !4982 (merged) feature/k3d-k8s-matrix Draft Extend matrix: k3d v1.33, v1.34, v1.35 ARM64
4 !4983 (merged) feature/k3d-manual-full-suite Draft Add qa_k3d_*_manual_full_suite jobs (parallel: 7, when: manual)
5 !4985 (merged) feature/k3d-remove-native Draft (WIP) Remove vcluster/EKS/GKE trigger_review_* jobs; keep one GKE nightly (stable channel, no vcluster)

MRs are sequenced — each is blocked by the previous (GitLab MR dependencies). Merge order: !4967 (merged)!4984 (merged)!4982 (merged)!4983 (merged)!4985 (merged).

MRs 2–5 temporarily target feature/k3d-per-job-review-env and will be retargeted to master before each merge.

Notes on "Remove Native Tests" (MR 5)

The current trigger_review_current and trigger_review_secondary jobs trigger child pipelines for 7 environments (v133, v134, v135, v135a/arm64, flux, gke135, eks134). Once the k3d matrix fully covers the K8s version matrix, these become redundant.

Keep: one scheduled nightly pipeline deploying to a real GKE cluster using the GCP stable release channel. This validates against a real cloud provider without needing vcluster — GCP manages node upgrades automatically. No vcluster infrastructure required.

Remove: all vcluster environment configs, the trigger_review_current/trigger_review_secondary manual gates, and the EKS environment.

Phase 4 — Cost/Benefit Evaluation

Before fully retiring the vcluster/GKE infrastructure, we need to evaluate whether the k3d approach is cost-neutral or better.

Key question: the current GKE cluster (gkevc-ci-cluster) is a static, always-on cost. The k3d approach trades that for per-minute ephemeral runner cost on large (e2e-tagged) instances. The crossover point depends on how many concurrent pipelines are running at any given time.

What to measure:

  • Average and peak number of concurrent review pipelines per day
  • Cost per e2e runner-minute (large DinD fleet) vs. monthly GKE cluster cost
  • Whether the GKE cluster can be fully decommissioned (no other workloads depending on it) or only right-sized
  • Runner compute time added per MR pipeline by the self-provisioning overhead (~10–15 min deploy per job × number of parallel jobs)

Expected outcome: the GKE cluster runs 24/7 regardless of pipeline activity, so at low-to-moderate pipeline concurrency the ephemeral model should be cheaper or equivalent. At very high concurrency it may be more expensive, but the reliability improvement justifies the cost in either case.

Action: gather 30-day pipeline metrics from GitLab CI analytics and GCP billing, then produce a before/after estimate. This can be done in parallel with the MR merge sequence and does not block !4985 (merged).

Edited by João Alexandre Cunha