Create Orbit Dashboards
## Problem to solve
Four GKG dashboards exist as raw Grafana JSON in the KG repo under [`dashboards/`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/tree/main/dashboards) but are not deployed anywhere. [`README.md#L275-276`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/README.md#L275-276) lists both dev and production Grafana as `TODO`. Operators have no default GKG dashboards on `dashboards.gitlab.net`.
Beyond the existing four, large parts of the telemetry surface defined in [`observability.md`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/design-documents/observability.md) have no dashboard coverage at all: scheduler, code indexing, namespace deletion, schema migration, NATS JetStream consumer health, ClickHouse destination health, query-engine threat signals, and service saturation. We want high-resolution visibility into every subsystem before GA, not just the four views that happened to get built first.
The `orbit` service is already registered in [`gitlab-com/runbooks`](https://gitlab.com/gitlab-com/runbooks) with SLIs for `gkg_webserver` and `gkg_indexer` ([`metrics-catalog/services/orbit.jsonnet`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/orbit.jsonnet)), but the only file under [`dashboards/orbit/`](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards/orbit) is an auto-generated 2-line service-overview stub.
<details>
<summary>old proposals</summary>
## Proposed solution
Build a comprehensive dashboard set in `gitlab-com/runbooks` under `dashboards/orbit/`. Two streams of work:
1. Port the four existing JSON dashboards to Jsonnet.
2. Add new dashboards for subsystems that have metrics today but no view.
Feature teams own `dashboards/<team>/` directly. `CODEOWNERS` has no entry for `/dashboards/`, so pure dashboard MRs need intra-team review only, not SRE.
### 1. Dashboards to port
| Source (KG repo) | Target (runbooks) | Content |
|---|---|---|
| `dashboards/gkg-overview.json` | `dashboards/orbit/main.dashboard.jsonnet` (replace stub) | Request rate/latency per route, service logs |
| `dashboards/etl-engine.json` | `dashboards/orbit/etl-engine.dashboard.jsonnet` | Throughput, E2E + handler latency, worker pool, NATS fetch, ClickHouse writes |
| `dashboards/query-pipeline.json` | `dashboards/orbit/query-pipeline.dashboard.jsonnet` | Pipeline stages (compile/execute/auth/hydration), CH rows/bytes/memory, content resolution, errors |
| `dashboards/sdlc-indexing.json` | `dashboards/orbit/sdlc-indexing.dashboard.jsonnet` | Per-entity throughput, watermark lag, datalake + transform latency, error kinds |
### 2. Dashboards to create
Proposed net-new dashboards, each backed by metrics already emitted by the GKG code. List is the starting set, not the ceiling. We should add more as gaps surface during implementation.
| New dashboard | Purpose | Backing metrics (prefix) |
|---|---|---|
| `dashboards/orbit/saturation.dashboard.jsonnet` | CPU, memory, FD, goroutine/tokio-task, pod restarts, OOMKills for `gkg-webserver` and `gkg-indexer` | Standard kube-state + process metrics, same pattern as other services |
| `dashboards/orbit/scheduler.dashboard.jsonnet` | ETL scheduler queue health, job cadence, backlog | `gkg_scheduler_*` (`crates/indexer/src/scheduler/metrics.rs`) |
| `dashboards/orbit/code-indexing.dashboard.jsonnet` | Code-indexing subsystem: repo throughput, per-language latency, errors | `gkg_indexer_code_*` (`crates/indexer/src/modules/code/metrics.rs`) |
| `dashboards/orbit/namespace-deletion.dashboard.jsonnet` | Namespace deletion throughput, failures, lag | `gkg_indexer_namespace_deletion_*` (`crates/indexer/src/modules/namespace_deletion/metrics.rs`) |
| `dashboards/orbit/schema-migration.dashboard.jsonnet` | Schema migration runs, duration, failures | `gkg_schema_migration_*` (`crates/indexer/src/metrics.rs`) |
| `dashboards/orbit/nats-jetstream.dashboard.jsonnet` | Consumer lag, ack/nack ratios, stream depth, delivery retries | `gkg_etl_nats_*`, `gkg_etl_messages_processed_total`, plus upstream NATS exporter metrics if available |
| `dashboards/orbit/clickhouse-destination.dashboard.jsonnet` | Write latency, rows/bytes written by table, retry/failure rate, connection pool | `gkg_etl_destination_*`, `gkg_query_pipeline_ch_*` |
| `dashboards/orbit/content-resolution.dashboard.jsonnet` | Gitaly content fetch latency, batch size, outcome split, blob size distribution | `gkg_content_*` (`crates/gkg-server/src/content/metrics.rs`) |
| `dashboards/orbit/query-engine-threat.dashboard.jsonnet` | Security rejection signals per threat class (fuels existing alert `GKGAuthFilterMissing`) | `gkg_query_engine_threat_*` (`crates/query-engine/compiler/src/metrics.rs`) |
| `dashboards/orbit/error-budget.dashboard.jsonnet` | SLO burn-rate panels per SLI, error budget remaining | Derived from `orbit.jsonnet` SLIs (may already be auto-generated under `dashboards/stage-groups/knowledge_graph.dashboard.jsonnet`; dedupe before adding) |
| `dashboards/orbit/cost-signals.dashboard.jsonnet` | ClickHouse rows read, bytes read, memory-per-query, destination rows written (proxies for cloud spend) | `gkg_query_pipeline_ch_read_*`, `gkg_etl_destination_rows_written_total` |
### Discovery phase (before writing Jsonnet)
Before implementation, audit every meter registered in the KG repo and map each metric to at least one panel. Source of truth: the eight `metrics.rs` files under `crates/` and the full catalog in [`observability.md`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/design-documents/observability.md). Any metric with no panel is either a gap to dashboard or a metric to retire.
### Implementation notes
- Template to copy: [`dashboards/duo-workflow-svc/errors-breakdown.dashboard.jsonnet`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/dashboards/duo-workflow-svc/errors-breakdown.dashboard.jsonnet). It is small and uses the same Mimir tenant pattern.
- Datasource: `mimirHelper.mimirDatasource('analytics-eventsdot')` (orbit's tenant).
- Job labels preserved: `job="gkg-webserver"`, `job="gkg-indexer"`. Confirm the scrape config matches before merge.
- Templating: every dashboard should expose at minimum a `$env`/`$stage` picker if the tenant is multi-environment. Preserve `$query_type` (query-pipeline) and `$entity` (sdlc-indexing) from the originals.
- Loki panels in the overview rely on `{component="gkg-..."}`. GitLab.com uses Elasticsearch for logs ([`observability.md#L11-12`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/design-documents/observability.md#L11-12)), so re-point those panels or drop them and link out.
- Every dashboard links back to `main.dashboard.jsonnet` via the dashboard header so operators can navigate between views.
- Commit any files regenerated by `make generate` (CI enforces this).
- Conventional Commits: `feat(orbit): ...`.
- Land dashboards incrementally in small MRs rather than one giant MR; easier to review and faster to get panels into production.
### Acceptance criteria
- [ ] Four ported Jsonnet dashboards land in `dashboards/orbit/` with the same panel coverage as the JSON sources.
- [ ] New dashboards listed above land in `dashboards/orbit/` with panels for every metric in their backing prefix.
- [ ] Every metric emitted by the GKG code has at least one panel somewhere in `dashboards/orbit/`, or a documented reason for omission.
- [ ] `make test` passes and `cd dashboards && ./test-dashboard.sh orbit/<name>.dashboard.jsonnet` renders locally for each dashboard.
- [ ] MRs merged to `gitlab-com/runbooks:master` with `deploy-dashboards` CI job green.
- [ ] Dashboards reachable on `dashboards.gitlab.net` and cross-linked from `main.dashboard.jsonnet`.
- [ ] JSON sources in KG repo removed, or replaced with a one-line pointer to the runbooks location, to prevent drift.
### Out of scope
- Operational runbook documentation under `docs/orbit/` (tracked in #493).
- New SLIs or alert rules. Edits to `metrics-catalog/services/orbit.jsonnet` regenerate `mimir-rules/*.yml`, which is CODEOWNER-gated to `@gitlab-org/scalability/observability`. File a separate issue if we want additional SLOs backed by the new dashboards.
- Adding Prometheus metrics not already emitted. If discovery surfaces a gap where no metric exists for a behavior we want to see, file a follow-up in the KG repo to instrument it.
<details>
<summary>Context on ownership and precedent</summary>
Recent precedent that pure dashboard MRs merge intra-team without SRE: [MR 10448](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10448) (Runner), [MR 10462](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10462) (Pipeline Authoring), [MR 10475](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10475) (DX).
SRE approval only triggered when metrics-catalog edits produced `mimir-rules/` diffs: [MR 10279](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10279), [MR 10343](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10343), [MR 10391](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10391).
Authoring guide: [`dashboards/AGENTS.md`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/dashboards/AGENTS.md). Metric catalog of truth: [`observability.md`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/design-documents/observability.md).
Metric emission by file:
- `crates/indexer/src/metrics.rs`: `gkg_etl_*`, `gkg_schema_migration_*`
- `crates/indexer/src/scheduler/metrics.rs`: `gkg_scheduler_*`
- `crates/indexer/src/modules/sdlc/metrics.rs`: `gkg_indexer_sdlc_*`
- `crates/indexer/src/modules/code/metrics.rs`: `gkg_indexer_code_*`
- `crates/indexer/src/modules/namespace_deletion/metrics.rs`: `gkg_indexer_namespace_deletion_*`
- `crates/gkg-server/src/pipeline/metrics.rs`: `gkg_query_pipeline_*`
- `crates/gkg-server/src/content/metrics.rs`: `gkg_content_*`
- `crates/query-engine/compiler/src/metrics.rs`: `gkg_query_engine_*`
</details>
Related: #493 (Orbit operational runbooks), parent epic &20992.
<!-- AI-Sessions
dir: ~/.claude/projects/-Users-angelo-rivera-gitlab-runbooks/
6732ce81-f7ac-4af9-b97a-63db74a599c2.jsonl (2026-04-21)
-->
</details>
issue