Build Orbit Staging Environment (#20991) · Epics · GitLab.org

Build Orbit Staging Environment

## Summary We need a new GKE staging environment for Orbit. The current `analytics-eventsdot-stg` cluster was originally set up for the Data Insights Platform / Snowplow migration, and Terraform will not let us rename it. Easier to start fresh with the right name than fight the tooling. This was decided at the [Feb 2026 GKG offsite](https://gitlab.com/gitlab-org/orbit/documentation/orbit-artifacts). Ankit confirmed the GKE bootstrapping modules can be reused and estimated about a week for provisioning if MR reviews are timely. Priority is high. This blocks all staging infra work and we need staging running for GKG GA (target: end of April 2026 for .com). ## Background The `analytics-eventsdot-stg` cluster was built for different work (Snowplow migration). Siphon and NATS already live there, but the naming is wrong and the cluster cannot be renamed without tearing it down and reimporting state. Rather than fight that, we are building a new cluster named Orbit that will house GKG, Siphon, and NATS together. ## Scope ### Infrastructure - [x] GCP project creation via `gl-security/corp/issue-tracker` (this has to happen first, everything else depends on it) - [x] GKE cluster via Terraform in `config-mgmt` on ops.gitlab.net - [ ] VPC, Cloud NAT, firewall rules - [x] Vault integration for secrets - [ ] ArgoCD bootstrap - [ ] PSC to staging Patroni (primary + replica) ### Services - [x] NATS JetStream - [x] Siphon producer + consumer - [x] GKG server (webserver, indexer, dispatcher, health-check modes) - [x] ClickHouse connectivity: use existing CH Cloud staging instance, separate logical DB for GKG ### Access and observability - [ ] Read-only GKE access for Orbit team members - [ ] Observability stack (kube-prometheus-stack, Grafana dashboards, logging) - [ ] Alerting and on-call per [component ownership model](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/production/component-ownership-model/) ## Decisions from offsite - New GCP project. Not reusing `analytics-eventsdot-stg`. - GKG gets its own logical ClickHouse DB within the existing CH Cloud staging instance. - Component ownership model: dev teams are first responders, SRE reviews MRs. - We will need multiple Siphon instances (main, CI, sec databases). - Our team writes the Terraform MRs, infra team reviews. ## References - [GKG offsite notes (Feb 2026)](https://gitlab.com/gitlab-org/orbit/documentation/orbit-artifacts) - [Knowledge Graph SSOT](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/KNOWLEDGE_GRAPH_SOURCE_OF_TRUTH.md) - [Prior art: analytics-eventsdot-stg provisioning (#2042)](https://gitlab.com/gitlab-com/gl-security/corp/issue-tracker/-/issues/2042) - [config-mgmt (Terraform)](https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt) - [gitlab-helmfiles](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles) - [GKG Helm Charts](https://gitlab.com/gitlab-org/orbit/gkg-helm-charts) - [GitLab Knowledge Graph as a Service - GA (#19744)](https://gitlab.com/groups/gitlab-org/-/work_items/19744)

epic