Code Indexer v2: Flexible pipeline for code graph construction
## Problem to Solve The current code graph pipeline has a per-language architecture that does not scale. Each of the 7 supported languages requires: - Its own **parser** (~2,000–5,000 lines of imperative tree-sitter walking) - Its own **type enums** (~20 variants per language, wrapped 3 times through the pipeline) - Its own **resolver** (~300–2,200 lines of bespoke reference resolution) - Its own **linker integration** (~15 files touched to add a new language) The result is ~30,000 lines of parser code, ~10,000 lines of resolver code, and 145 definition/import type variants across 14 enums — with significant structural duplication between languages. The three-layer type wrapping (parser enums → linker wrapper enums → processor dispatch enums) exists only to carry information that collapses to a `&str` at the Arrow serialization boundary. ``` PythonDefinitionType (11 variants) ──┐ JavaDefinitionType (14 variants) │ Parser layer: 95 variants across 7 enums RubyDefinitionType (6 variants) │ KotlinDefinitionType (16 variants) │ ... ──┘ ↓ wrapped into DefinitionType::Python(PythonDefinitionType::Class) ← Linker layer ↓ dispatched via Definitions::Python(Vec<...>) ← Processor layer ↓ serialized as "class" ← Arrow output ``` ## Proposed Solution Replace the per-language architecture with a generic, language-agnostic pipeline. Four components: ### 1. Pipeline framework with canonical types One set of canonical types flows through the entire pipeline. `DefKind` (10 variants) + `&'static str` replaces 95 enum variants. A `LanguagePipeline` trait allows both generic (DSL parser + SSA resolver) and custom (full control) strategies behind one interface. One line in a macro registers a language. ### 2. Declarative DSL engine A declarative DSL where each language is ~80–130 lines of rule tables instead of ~2,000–5,000 lines of imperative walkers. The DSL engine handles the tree-sitter walk; language specs just declare which node kinds are scopes, references, imports, and bindings. ### 3. SSA-based generic resolver One generic resolver for all languages, based on the Braun et al. SSA construction algorithm. Per-language differences (import strategies, receiver conventions, chain modes) are declarative rule tables. ### 4. YAML/Cypher test framework End-to-end correctness validation. Self-contained YAML test suites with inline source fixtures, queried with Cypher against the resulting graph. Catches regressions that unit tests miss. --- ### Background and prior art This work builds on several earlier explorations: - knowledge-graph!764 — *spike: declarative DSL engine for near-instant language support*. Proved that C and C++ could be added with ~30 lines of config. Python definition extraction replicated to show parity. - knowledge-graph!766 — *feat(parser): declarative DSL engine with C/C++ language support*. Clean iteration of the DSL, with C++ composing from C by inheriting rules. - knowledge-graph!767 — *feat(linker): global backtracking for language-agnostic reference resolution*. First pass at generic resolution — global name matching with local-first preference and ambiguity tracking. - knowledge-graph!885 — *spike: canonical IR types and unified Language config*. Introduced `code-graph-types` crate, `DefKind` enum, `ToCanonical` trait, and the `register_languages!` macro. These were themselves inspired by: - [gitlab-code-parser#38](https://gitlab.com/gitlab-org/rust/gitlab-code-parser/-/work_items/38) — original proposal for declarative language support - [knowledge-graph#3](https://gitlab.com/gitlab-org/rust/knowledge-graph/-/work_items/3) — codescope prototype with global backtracking resolver The SSA-based resolver is an application of: - Braun, M., Buchwald, S., Hack, S., Leißa, R., Mallon, C., Zwinkau, A. (2013). [Simple and Efficient Construction of Static Single Assignment Form](https://dl.acm.org/doi/10.1007/978-3-642-37051-9_6). *Compiler Construction (CC 2013)*, LNCS vol 7791. — On-the-fly SSA construction without pre-computed CFG or dominance frontiers. Three operations (`write_variable`, `read_variable`, `seal_block`) with lazy phi insertion and trivial phi elimination.
epic