[indexer] Support getting namespace indexing status information
## Problem Admins and automated systems have no lightweight way to check indexing status. Basic questions require manual log reading or direct ClickHouse queries: - Is indexing running for this namespace? - How far along is initial backfill after enabling Knowledge Graph? - What entity counts are indexed under a given group or project? - When was code last indexed for a specific project? - Are we safe to run query assertions in an e2e test? ## Requirements ### R1: Namespace-scoped indexing progress The system must expose per-namespace indexing progress including: - Overall indexing state (pending, backfilling, incremental, completed) - Per-entity-type status (pending, in_progress, completed) and counts - Per-entity-type edge counts - Code indexing progress (projects indexed vs total, per-branch commit status) - Watermark information to distinguish backfill from incremental indexing ### R2: Hierarchy-aware count aggregation Not every user has access to the top-level group. The system must support returning counts scoped to an arbitrary traversal path prefix (any group or subgroup), not just the root namespace. Users with access to multiple disjoint subtrees (e.g., \`1/9970/200/\` and \`1/9970/300/\` but not \`1/9970/\`) must be able to get counts for each subtree independently. ### R3: Project-level code graph lookup It must be possible to look up code indexing status by project ID, including: - Which branches have been indexed - Last indexed commit per branch - Entity counts (File, Directory, Definition, ImportedSymbol) - Code edge counts (CONTAINS, DEFINES, IMPORTS, CALLS) ### R4: Edge count tracking Edge counts must be available at both the namespace level (all edge types from \`gl_edge\`, covering SDLC and code edges) and the project level (code-specific edges). ### R5: Backfill and initial indexing states The system must clearly distinguish: - Namespace enabled but not yet picked up by the indexer - First-ever ETL run in progress (backfilling from epoch) - At least one full pass completed (incremental mode) - Code backfill waiting on Project plan completion A consumer must be able to determine "is initial indexing fully complete?" without interpreting checkpoint internals. ### R6: Staleness and consistency - Progress data may be eventually consistent (bounded by the ETL interval) - Each progress record must include an \`updated_at\` timestamp so consumers can detect staleness - Progress reporting failures must not block the indexing pipeline - The system must self-heal after transient failures (next successful ETL run recovers state) ### R7: E2E testability The endpoint must serve as a reliable "indexing complete" signal for automated e2e test harnesses. The test flow: 1. Spin up the full GKG stack (indexer + webserver + ClickHouse + NATS) alongside Rails 2. Enable a namespace, insert seed data 3. Poll the progress endpoint until indexing is complete 4. Execute query assertions This requires: - A deterministic "all done" condition that can be checked programmatically - Fast reads (polling should not add load to ClickHouse) - Clear state transitions that tests can assert against ### R8: Ontology-driven response shape The response must be driven by the ontology at runtime. Adding a new entity type to the ontology should automatically surface it in the progress response. Domains should mirror the existing \`GetGraphSchema\` / \`GetGraphStats\` grouping. ## Design See [ADR 009: Indexing progress via NATS KV](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/blob/arivera/009-indexing-progress-nats-kv/docs/design-documents/decisions/009_indexing_progress_nats_kv.md) for the proposed architecture. **Key decisions:** - Indexer writes progress to NATS KV as a side-effect of ETL - Webserver reads from NATS KV (no ClickHouse on the read path) - Pre-aggregated counts at every group-level prefix for O(1) reads at any hierarchy level - Per-project code graph keys for direct project ID lookup - \`meta\` keys track pipeline state and watermarks for backfill/incremental distinction
issue