Harden code indexer for scale, stack safety, and observability (#449) · Issues · GitLab.org / orbit / GitLab Knowledge Graph

Harden code indexer for scale, stack safety, and observability

## Summary The code indexer and `code-graph` pipeline need a dedicated production-hardening pass before large-scale rollout. The current implementation has correctness risks under concurrency, memory and disk amplification risks on large repositories, uneven guardrails for pathological inputs, and limited observability for queueing, fallback paths, and resource exhaustion. This issue tracks the general scalability and resilience work for the code indexing path. ## Key concerns ### Correctness under concurrency - Newer branch commits can be dropped while an older indexing task is still in flight because dispatch currently coalesces by subject in a way that can discard the newest work. - Branch locking is time-based and not renewed for long-running indexing jobs, which creates room for duplicate indexing or redelivery on slow repositories. ### Memory, CPU, and write amplification - The indexing pipeline still materializes large amounts of repository state in memory before writing. - Per-repo concurrency scales with host CPU count and is not coordinated against a global per-pod memory or IO budget. - Large diffs or oversized blobs can force fallback to full archive download, which amplifies network, disk, and CPU load exactly when repositories are already large. ### Disk and repository-shape guardrails - Archive extraction and repository caching do not have clear byte, file-count, or eviction budgets. - Large repositories and long-lived indexer pods can exhaust ephemeral storage. - Hard limits exist in several places, but they are not consistently operator-visible or surfaced as durable failure reasons. ### Parser and linker safety - Python had uncovered recursive hot paths in parser/reference resolution. Those are being addressed separately. - Ruby should be double-checked next for stack-safety because it still has recursive expression extraction and visitor-driven AST traversal without explicit stack guards in our code paths. ### Observability gaps - The code indexing path needs stage-level tracing, richer metrics, structured outcome logs, and freshness/lag dashboards. - Current metrics are not enough to answer basic production questions like: why did a repo fall back to full download, how large was the extracted archive, which stage used the most time, or how stale a branch index is relative to HEAD. ## Proposed scope - Make task dispatch coalesce to the latest desired commit instead of dropping newer work. - Replace fixed branch locks with renewable leases and in-progress heartbeats. - Add explicit resource budgets for archive bytes, extracted bytes, extracted files, cache size, blob size, diff size, and per-repo concurrency. - Refactor large writes toward chunked or budgeted emission instead of full-repo materialization. - Harden parser/linker recursion for remaining languages that still need it. - Add end-to-end observability for queue lag, freshness lag, fallback reasons, resource usage, and per-stage latency. ## Acceptance criteria - [ ] Latest-commit indexing semantics are defined and implemented for hot branches. - [ ] Long-running code indexing jobs renew their branch lease and ack progress while work is in flight. - [ ] Code indexing enforces configurable memory, disk, archive, and concurrency budgets. - [ ] Oversized repositories or inputs fail explicitly with durable reason codes instead of silent degradation. - [ ] Remaining unguarded recursive parser/linker paths are either hardened or explicitly documented and tested. - [ ] Code indexing emits stage-level traces, per-stage latency histograms, fallback counters, resource-usage metrics, and structured task summaries. - [ ] Runbooks and dashboards exist for freshness lag, queue lag, resource exhaustion, and repeated fallback patterns.

issue