[Spike] Replace vcluster and native review jobs with per-job k3d clusters to eliminate shared host contention (#6421) · Issues · GitLab.org / charts / GitLab Chart

[Spike] Replace vcluster and native review jobs with per-job k3d clusters to eliminate shared host contention

## Problem The `review_*`, `review_specs_*`, and `qa_*` CI jobs fail frequently with: - `helm upgrade --install` timeouts (900s limit) - Rollout / `kubectl rollout status` timeouts - Pod scheduling failures (pods stuck Pending) These failures happen across **both** vcluster and native GKE/EKS environments. ## Current Architecture Each MR already gets its own vcluster (`rvw-{CI_PIPELINE_ID}`) running inside a shared GKE host cluster (`gkevc-ci-cluster`). This provides namespace-level isolation, but **all vclusters still share the same GKE node pool**. When many MRs run pipelines concurrently: 1. GKE autoscaler can't provision new nodes fast enough (2–4 min per node) 2. vcluster pods (control plane, workloads) stay `Pending` during that window 3. `helm --wait --timeout 900s` fires before nodes join 4. The job fails — even though there's nothing wrong with the chart itself The GKE 1.35 and EKS 1.34 "native" environments (no vcluster) suffer the same starvation on their own shared clusters. ## Proposed Solution **Give every CI job its own fully-isolated Kubernetes cluster with dedicated CPU/RAM.** No shared host; no autoscaling contention. Tool: **[k3d](https://k3d.io/)** — k3s packaged as Docker containers. It: - Runs entirely inside the GitLab runner's Docker environment (no VM) - Starts in ~30 seconds - Supports pinned K8s versions (`--image rancher/k3s:v1.35.0-k3s1`) - Includes a built-in LoadBalancer (via Traefik + k3d port mapping) - Supports ARM64 - Requires only a Docker executor with `privileged: true` ## Proposed Job Topology ### Option B (recommended): one k3d cluster per test job The `review_*` (deploy-only) job is **eliminated**. Each downstream test job becomes self-contained: ``` review_specs_k3d_v135: script: - k3d_install && k3d_create # ~30s - deploy_external_services # Valkey, CNPG, Garage - helm upgrade --install --wait # GitLab chart - run_specs # feature specs against local cluster - k3d_delete qa_k3d_v135: (parallel: 5) script: - k3d_install && k3d_create - deploy_external_services - helm upgrade --install --wait - run_qa_tests # each worker tests its own instance - k3d_delete ``` Each parallel QA worker gets its own independent GitLab instance. The `qa_report` stage is unchanged — it still aggregates JUnit artifacts. **Trade-off:** Each job pays a ~10–15 min deploy overhead, so total compute is higher. Wall time is similar because jobs are parallel. This is acceptable given that the current shared approach produces unreliable results regardless of timing. ## Key Technical Challenges ### 1. DNS and TLS Current setup uses ExternalDNS + wildcard cert on `*.cloud-native-vcluster.helm-charts.win`. With k3d inside a CI container, the cluster is on the Docker bridge — no public DNS. **Proposed approach for PoC:** use [nip.io](https://nip.io/) with the runner's Docker bridge gateway IP: ```bash KUBE_INGRESS_BASE_DOMAIN=$(ip route get 1 | awk '{print $7; exit}').nip.io # e.g. 172.17.0.1.nip.io — resolves back to 172.17.0.1 ``` TLS: disable for the PoC (`global.ingress.tls.enabled=false`) or use a self-signed cert accepted by the spec runner. For production rollout, a dedicated `*.ci-local.helm-charts.win` zone pointing at a static runner IP range could restore proper TLS without ExternalDNS. ### 2. Agent-based cluster connectivity Current jobs use `kubectl config use-context ${AGENT_PROJECT_PATH}:${AGENT_NAME}`. With a local k3d cluster, we skip the GitLab Agent entirely and use the kubeconfig exported by k3d directly. Change required in `scripts/ci/autodevops.sh` `set_context()`: when `K3D_MODE=true`, skip the agent context switch. ### 3. K8s version matrix k3d supports the same version matrix via `--image rancher/k3s:<version>`. All five current environments (v1.33, v1.34, v1.35, ARM64, Flux) can be preserved by parameterising `K3D_K8S_VERSION`. ### 4. External services Valkey, CloudNativePG, and Garage are deployed inside the cluster via Helm. They work identically inside k3d. No changes to `scripts/ci/lib/{valkey,cloudnativepg,garage}.sh`. ## Phased Approach ### Phase 1 — PoC (single environment, merged job) Validate that k3d can deploy the full GitLab chart reliably: - New `scripts/ci/k3d.sh` (lifecycle: install, create, kubeconfig, delete) - New `scripts/ci/k3d_deploy.sh` (mirrors `vcluster_deploy.sh`, writes nip.io URL to `VARIABLES_FILE`) - New `.gitlab/ci/k3d-review-apps.gitlab-ci.yml` (base template, merged deploy+specs) - New `.gitlab/ci/environments/k3d.135.amd64.gitlab-ci.yml` (gated behind `LIMIT_TO=k3d135`) - Minor update to `scripts/ci/autodevops.sh` `set_context()` All existing vcluster jobs remain untouched during the PoC. ### Phase 2 — Per-job topology If the PoC passes: restructure all environment configs to use one k3d cluster per test job. Eliminate the `review_*` (deploy-only) stage. ### Phase 3 — Full rollout Replace all vcluster environment configs with k3d equivalents. Remove vcluster infrastructure dependencies. ## Files to Change (PoC only) | File | Change | |------|--------| | `scripts/ci/k3d.sh` | New — cluster lifecycle | | `scripts/ci/k3d_deploy.sh` | New — deploy wrapper with nip.io URL | | `.gitlab/ci/k3d-review-apps.gitlab-ci.yml` | New — base k3d CI template | | `.gitlab/ci/environments/k3d.135.amd64.gitlab-ci.yml` | New — PoC environment config | | `scripts/ci/autodevops.sh` | `set_context()` — skip agent when `K3D_MODE=true` | | `.gitlab-ci.yml` | Include new k3d template, add `K3D_MODE` variable | ## Acceptance Criteria (PoC) - [x] `LIMIT_TO=k3d135` pipeline runs `review_k3d_v135` and completes without timeout - [x] k3d cluster creates cleanly with external services (Valkey, CNPG, Garage) - [x] `helm upgrade --install` completes within 900s - [x] Feature specs reach GitLab via nip.io URL and pass - [x] Running 3–5 concurrent pipelines shows no resource contention (5 parallel QA workers each get their own isolated cluster) - [x] p50 job duration is comparable to (or better than) the existing `review_v135` vcluster job ## Related - Current vcluster scripts: `scripts/ci/vcluster.sh`, `scripts/ci/vcluster_deploy.sh` - Current environment configs: `.gitlab/ci/environments/vcluster.*.gitlab-ci.yml` - Current base template: `.gitlab/ci/vcluster-review-apps.gitlab-ci.yml` --- ## Implementation Status ### Phase 1 — PoC (!4967) ✅ In review All acceptance criteria met. See MR !4967 for details. **Findings during PoC:** - GitLab 17+ PAT tokens use a routing suffix with dots (e.g. `glpat-xxx.01.yyy`) — the grep pattern extracting the token from `gitlab-rails runner` output must include `.` in the character class. - SSH push-over-SSH specs require `--port "22:22@loadbalancer"` on `k3d cluster create`; the NGINX ingress TCP config for port 22 → gitlab-shell is already in the chart defaults. - `ip route get` field position for the source IP varies with routing topology; parse by `src` keyword, not position. - The `e2e` runner tag (GitLab internal DinD fleet) is required. `saas-linux-large-amd64` runners do not have DinD pre-configured. The `e2e` fleet runs privileged DinD with sufficient resources for the full GitLab workload. - The CI/CD `GITLAB_QA_ADMIN_ACCESS_TOKEN` is scoped to the shared vcluster instance and returns 401 against a fresh k3d deployment. The QA job mints a fresh admin PAT via `gitlab-rails runner`, exports it as `GITLAB_ADMIN_TOKEN` in `VARIABLES_FILE`, then re-exports as `GITLAB_QA_ADMIN_ACCESS_TOKEN` so gitlab-qa picks it up. Basic HTTP auth does not work reliably against freshly-deployed instances. ### Planned Follow-up MRs | # | MR | Branch | Status | Description | |---|----|----|-------|-------------| | 1 | !4967 | `feature/k3d-per-job-review-env` | In review — pipeline ✅ | Phase 1 PoC: k3d v1.35 amd64, NGINX ingress | | 2 | !4984 | `feature/k3d-envoy-gateway` | Draft | Replace NGINX ingress with Envoy Gateway (Gateway API) | | 3 | !4982 | `feature/k3d-k8s-matrix` | Draft | Extend matrix: k3d v1.33, v1.34, v1.35 ARM64 | | 4 | !4983 | `feature/k3d-manual-full-suite` | Draft | Add `qa_k3d_*_manual_full_suite` jobs (parallel: 7, when: manual) | | 5 | !4985 | `feature/k3d-remove-native` | Draft (WIP) | Remove vcluster/EKS/GKE `trigger_review_*` jobs; keep one GKE nightly (stable channel, no vcluster) | MRs are sequenced — each is blocked by the previous (GitLab MR dependencies). Merge order: !4967 → !4984 → !4982 → !4983 → !4985. MRs 2–5 temporarily target `feature/k3d-per-job-review-env` and will be retargeted to `master` before each merge. ### Notes on "Remove Native Tests" (MR 5) The current `trigger_review_current` and `trigger_review_secondary` jobs trigger child pipelines for 7 environments (v133, v134, v135, v135a/arm64, flux, gke135, eks134). Once the k3d matrix fully covers the K8s version matrix, these become redundant. **Keep:** one scheduled nightly pipeline deploying to a real GKE cluster using the GCP stable release channel. This validates against a real cloud provider without needing vcluster — GCP manages node upgrades automatically. No vcluster infrastructure required. **Remove:** all vcluster environment configs, the `trigger_review_current`/`trigger_review_secondary` manual gates, and the EKS environment. ### Phase 4 — Cost/Benefit Evaluation Before fully retiring the vcluster/GKE infrastructure, we need to evaluate whether the k3d approach is cost-neutral or better. **Key question:** the current GKE cluster (`gkevc-ci-cluster`) is a static, always-on cost. The k3d approach trades that for per-minute ephemeral runner cost on large (`e2e`-tagged) instances. The crossover point depends on how many concurrent pipelines are running at any given time. **What to measure:** - Average and peak number of concurrent review pipelines per day - Cost per `e2e` runner-minute (large DinD fleet) vs. monthly GKE cluster cost - Whether the GKE cluster can be fully decommissioned (no other workloads depending on it) or only right-sized - Runner compute time added per MR pipeline by the self-provisioning overhead (~10–15 min deploy per job × number of parallel jobs) **Expected outcome:** the GKE cluster runs 24/7 regardless of pipeline activity, so at low-to-moderate pipeline concurrency the ephemeral model should be cheaper or equivalent. At very high concurrency it may be more expensive, but the reliability improvement justifies the cost in either case. **Action:** gather 30-day pipeline metrics from GitLab CI analytics and GCP billing, then produce a before/after estimate. This can be done in parallel with the MR merge sequence and does not block !4985.

issue