Create Orbit Dashboards
## Problem to solve Four GKG dashboards exist as raw Grafana JSON in the KG repo under [`dashboards/`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/tree/main/dashboards) but are not deployed anywhere. [`README.md#L275-276`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/README.md#L275-276) lists both dev and production Grafana as `TODO`. Operators have no default GKG dashboards on `dashboards.gitlab.net`. Beyond the existing four, large parts of the telemetry surface defined in [`observability.md`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/design-documents/observability.md) have no dashboard coverage at all: scheduler, code indexing, namespace deletion, schema migration, NATS JetStream consumer health, ClickHouse destination health, query-engine threat signals, and service saturation. We want high-resolution visibility into every subsystem before GA, not just the four views that happened to get built first. The `orbit` service is already registered in [`gitlab-com/runbooks`](https://gitlab.com/gitlab-com/runbooks) with SLIs for `gkg_webserver` and `gkg_indexer` ([`metrics-catalog/services/orbit.jsonnet`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/orbit.jsonnet)), but the only file under [`dashboards/orbit/`](https://gitlab.com/gitlab-com/runbooks/-/tree/master/dashboards/orbit) is an auto-generated 2-line service-overview stub. <details> <summary>old proposals</summary> ## Proposed solution Build a comprehensive dashboard set in `gitlab-com/runbooks` under `dashboards/orbit/`. Two streams of work: 1. Port the four existing JSON dashboards to Jsonnet. 2. Add new dashboards for subsystems that have metrics today but no view. Feature teams own `dashboards/<team>/` directly. `CODEOWNERS` has no entry for `/dashboards/`, so pure dashboard MRs need intra-team review only, not SRE. ### 1. Dashboards to port | Source (KG repo) | Target (runbooks) | Content | |---|---|---| | `dashboards/gkg-overview.json` | `dashboards/orbit/main.dashboard.jsonnet` (replace stub) | Request rate/latency per route, service logs | | `dashboards/etl-engine.json` | `dashboards/orbit/etl-engine.dashboard.jsonnet` | Throughput, E2E + handler latency, worker pool, NATS fetch, ClickHouse writes | | `dashboards/query-pipeline.json` | `dashboards/orbit/query-pipeline.dashboard.jsonnet` | Pipeline stages (compile/execute/auth/hydration), CH rows/bytes/memory, content resolution, errors | | `dashboards/sdlc-indexing.json` | `dashboards/orbit/sdlc-indexing.dashboard.jsonnet` | Per-entity throughput, watermark lag, datalake + transform latency, error kinds | ### 2. Dashboards to create Proposed net-new dashboards, each backed by metrics already emitted by the GKG code. List is the starting set, not the ceiling. We should add more as gaps surface during implementation. | New dashboard | Purpose | Backing metrics (prefix) | |---|---|---| | `dashboards/orbit/saturation.dashboard.jsonnet` | CPU, memory, FD, goroutine/tokio-task, pod restarts, OOMKills for `gkg-webserver` and `gkg-indexer` | Standard kube-state + process metrics, same pattern as other services | | `dashboards/orbit/scheduler.dashboard.jsonnet` | ETL scheduler queue health, job cadence, backlog | `gkg_scheduler_*` (`crates/indexer/src/scheduler/metrics.rs`) | | `dashboards/orbit/code-indexing.dashboard.jsonnet` | Code-indexing subsystem: repo throughput, per-language latency, errors | `gkg_indexer_code_*` (`crates/indexer/src/modules/code/metrics.rs`) | | `dashboards/orbit/namespace-deletion.dashboard.jsonnet` | Namespace deletion throughput, failures, lag | `gkg_indexer_namespace_deletion_*` (`crates/indexer/src/modules/namespace_deletion/metrics.rs`) | | `dashboards/orbit/schema-migration.dashboard.jsonnet` | Schema migration runs, duration, failures | `gkg_schema_migration_*` (`crates/indexer/src/metrics.rs`) | | `dashboards/orbit/nats-jetstream.dashboard.jsonnet` | Consumer lag, ack/nack ratios, stream depth, delivery retries | `gkg_etl_nats_*`, `gkg_etl_messages_processed_total`, plus upstream NATS exporter metrics if available | | `dashboards/orbit/clickhouse-destination.dashboard.jsonnet` | Write latency, rows/bytes written by table, retry/failure rate, connection pool | `gkg_etl_destination_*`, `gkg_query_pipeline_ch_*` | | `dashboards/orbit/content-resolution.dashboard.jsonnet` | Gitaly content fetch latency, batch size, outcome split, blob size distribution | `gkg_content_*` (`crates/gkg-server/src/content/metrics.rs`) | | `dashboards/orbit/query-engine-threat.dashboard.jsonnet` | Security rejection signals per threat class (fuels existing alert `GKGAuthFilterMissing`) | `gkg_query_engine_threat_*` (`crates/query-engine/compiler/src/metrics.rs`) | | `dashboards/orbit/error-budget.dashboard.jsonnet` | SLO burn-rate panels per SLI, error budget remaining | Derived from `orbit.jsonnet` SLIs (may already be auto-generated under `dashboards/stage-groups/knowledge_graph.dashboard.jsonnet`; dedupe before adding) | | `dashboards/orbit/cost-signals.dashboard.jsonnet` | ClickHouse rows read, bytes read, memory-per-query, destination rows written (proxies for cloud spend) | `gkg_query_pipeline_ch_read_*`, `gkg_etl_destination_rows_written_total` | ### Discovery phase (before writing Jsonnet) Before implementation, audit every meter registered in the KG repo and map each metric to at least one panel. Source of truth: the eight `metrics.rs` files under `crates/` and the full catalog in [`observability.md`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/design-documents/observability.md). Any metric with no panel is either a gap to dashboard or a metric to retire. ### Implementation notes - Template to copy: [`dashboards/duo-workflow-svc/errors-breakdown.dashboard.jsonnet`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/dashboards/duo-workflow-svc/errors-breakdown.dashboard.jsonnet). It is small and uses the same Mimir tenant pattern. - Datasource: `mimirHelper.mimirDatasource('analytics-eventsdot')` (orbit's tenant). - Job labels preserved: `job="gkg-webserver"`, `job="gkg-indexer"`. Confirm the scrape config matches before merge. - Templating: every dashboard should expose at minimum a `$env`/`$stage` picker if the tenant is multi-environment. Preserve `$query_type` (query-pipeline) and `$entity` (sdlc-indexing) from the originals. - Loki panels in the overview rely on `{component="gkg-..."}`. GitLab.com uses Elasticsearch for logs ([`observability.md#L11-12`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/design-documents/observability.md#L11-12)), so re-point those panels or drop them and link out. - Every dashboard links back to `main.dashboard.jsonnet` via the dashboard header so operators can navigate between views. - Commit any files regenerated by `make generate` (CI enforces this). - Conventional Commits: `feat(orbit): ...`. - Land dashboards incrementally in small MRs rather than one giant MR; easier to review and faster to get panels into production. ### Acceptance criteria - [ ] Four ported Jsonnet dashboards land in `dashboards/orbit/` with the same panel coverage as the JSON sources. - [ ] New dashboards listed above land in `dashboards/orbit/` with panels for every metric in their backing prefix. - [ ] Every metric emitted by the GKG code has at least one panel somewhere in `dashboards/orbit/`, or a documented reason for omission. - [ ] `make test` passes and `cd dashboards && ./test-dashboard.sh orbit/<name>.dashboard.jsonnet` renders locally for each dashboard. - [ ] MRs merged to `gitlab-com/runbooks:master` with `deploy-dashboards` CI job green. - [ ] Dashboards reachable on `dashboards.gitlab.net` and cross-linked from `main.dashboard.jsonnet`. - [ ] JSON sources in KG repo removed, or replaced with a one-line pointer to the runbooks location, to prevent drift. ### Out of scope - Operational runbook documentation under `docs/orbit/` (tracked in #493). - New SLIs or alert rules. Edits to `metrics-catalog/services/orbit.jsonnet` regenerate `mimir-rules/*.yml`, which is CODEOWNER-gated to `@gitlab-org/scalability/observability`. File a separate issue if we want additional SLOs backed by the new dashboards. - Adding Prometheus metrics not already emitted. If discovery surfaces a gap where no metric exists for a behavior we want to see, file a follow-up in the KG repo to instrument it. <details> <summary>Context on ownership and precedent</summary> Recent precedent that pure dashboard MRs merge intra-team without SRE: [MR 10448](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10448) (Runner), [MR 10462](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10462) (Pipeline Authoring), [MR 10475](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10475) (DX). SRE approval only triggered when metrics-catalog edits produced `mimir-rules/` diffs: [MR 10279](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10279), [MR 10343](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10343), [MR 10391](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/10391). Authoring guide: [`dashboards/AGENTS.md`](https://gitlab.com/gitlab-com/runbooks/-/blob/master/dashboards/AGENTS.md). Metric catalog of truth: [`observability.md`](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/main/docs/design-documents/observability.md). Metric emission by file: - `crates/indexer/src/metrics.rs`: `gkg_etl_*`, `gkg_schema_migration_*` - `crates/indexer/src/scheduler/metrics.rs`: `gkg_scheduler_*` - `crates/indexer/src/modules/sdlc/metrics.rs`: `gkg_indexer_sdlc_*` - `crates/indexer/src/modules/code/metrics.rs`: `gkg_indexer_code_*` - `crates/indexer/src/modules/namespace_deletion/metrics.rs`: `gkg_indexer_namespace_deletion_*` - `crates/gkg-server/src/pipeline/metrics.rs`: `gkg_query_pipeline_*` - `crates/gkg-server/src/content/metrics.rs`: `gkg_content_*` - `crates/query-engine/compiler/src/metrics.rs`: `gkg_query_engine_*` </details> Related: #493 (Orbit operational runbooks), parent epic &20992. <!-- AI-Sessions dir: ~/.claude/projects/-Users-angelo-rivera-gitlab-runbooks/ 6732ce81-f7ac-4af9-b97a-63db74a599c2.jsonl (2026-04-21) --> </details>
issue