feat(code-graph): v2 pipeline performance + zero-fuzz resolution
What does this MR do and why?
Part of #455 (closed), #456 (closed), #457 (closed), #458 (closed). Continues the v2 code-graph pipeline from !887 (merged) with three focus areas: performance optimization, architectural simplification, and correctness fixes. The v2 pipeline went from a multi-representation system (5 intermediate data structures, 3 rebuilds) to a single graph-central architecture where petgraph's DiGraph is the only representation from parse to serialization.
Graph-central architecture
The previous architecture rebuilt the same data multiple times:
Parser → CanonicalResult (flat vecs)
→ ResolutionContext (rebuilds indexes)
→ Resolver → ResolvedEdge (flat vec)
→ GraphBuilder (rebuilds into petgraph)
→ CodeGraph → Arrow batchesNow the graph exists from the start. Each file's parse locks a shared Mutex<CodeGraph>, adds its nodes, gets NodeIndex values back, and walks with them. The resolver reads the graph directly. Edges go straight into the graph. No intermediate types.
Mutex<CodeGraph> created
│
├─ Parallel: parse → lock(graph) → add_file_nodes → unlock → walk with NodeIndex
│ SSA writes Value::Def(NodeIndex), Value::Import(NodeIndex)
│
├─ Sequential: graph.finalize() (dirs, containment, ancestor chains)
│
├─ Parallel: resolve(&graph, &results, walks) → (NodeIndex, NodeIndex, GraphEdge)
│
└─ Sequential: graph.add_edges()Deleted types: DefRef, EdgeSource, ResolvedEdge, ResolutionContext, DefinitionIndex, MemberIndex, GraphBuilder, linker-side BranchRule/LoopRule/BindingRule/BindingKind
Deleted fields: def_index (replaced by defs_by_file), branches/loops/bindings on ResolutionRules
Module consolidation: 11 linker files → 5 (graph.rs, resolve.rs, walker.rs, ssa.rs, rules.rs)
Parser extracts bindings and control flow
Binding extraction, branch detection, and loop detection moved from the walker to the parser. The walker no longer matches AST node kinds against rule tables — the parser already did that. The walker matches by byte offset against parsed CanonicalBinding and CanonicalControlFlow structs.
// Parser output (new fields on CanonicalResult):
pub struct CanonicalBinding {
pub name: String,
pub kind: BindingKind,
pub type_annotation: Option<String>, // from AST: "Builder", "int", etc.
pub rhs_name: Option<String>, // callee name: get_builder() → "get_builder"
pub instance_attr: bool,
}
pub struct CanonicalControlFlow {
pub kind: ControlFlowKind, // Branch { has_catch_all } | Loop
pub node_kind: String,
pub byte_range: (usize, usize),
pub children: Vec<ControlFlowChild>,
}Per-language binding/branch/loop rules moved from HasRules to DslLanguage, so DslParser picks them up during parsing:
impl DslLanguage for JavaDsl {
fn bindings() -> Vec<BindingRule> {
vec![
binding("local_variable_declaration", BindingKind::Assignment)
.name_from(&["declarator", "name"])
.typed(&["type"], &["int", "long", "void", "String", ...]),
]
}
fn branches() -> Vec<BranchRule> { /* if, try, switch, ternary */ }
fn loops() -> Vec<LoopRule> { /* for, while, enhanced_for, do */ }
}SSA improvements
Value::Alias(IStr) — deferred binding resolution. When the walker sees b = getService(), it writes Value::Alias("getService") instead of doing an SSA read during the walk phase. The resolver follows the alias later with full cross-file context. The walker is write-only for SSA — no reads during walk.
Value::Def(NodeIndex) / Value::Import(NodeIndex) — SSA values reference graph nodes directly. No (file_idx, def_idx) double indirection. graph.def(idx) gives the definition in O(1).
Configurable resolution stages
The hardcoded 3-tier fallback in resolve_bare (SSA → imports → implicit this) replaced by a declarative stage list:
// Java: SSA first, then imports, then implicit member lookup
bare_stages: vec![ResolveStage::SSA, ResolveStage::ImportStrategies, ResolveStage::ImplicitMember]
// Python: SSA and imports only (explicit self, no implicit this)
bare_stages: vec![ResolveStage::SSA, ResolveStage::ImportStrategies]Hidden heuristics made configurable via ResolveSettings:
chain_fallback: bool— fall back to bare resolution when chain base unresolvablecompound_key_recovery: bool— mid-chain SSA recovery via compound keysimplicit_this_on_base: bool— implicit member lookup on chain basesper_file_timeout: Option<Duration>— cap resolution time per filemax_chain_depth: usize— truncate long fluent chains
Performance Benchmarks (elasticsearch, 22,935 Java files):
| Metric | Value |
|---|---|
| Parse + walk | ~3s |
| Resolve | ~4s (2M edges) |
| Peak RSS | 4.3 GB (was 5.7 GB) |
Correctness
- Zero-fuzz resolution: removed
GlobalNamestrategy,lookup_fqn_joined, and bare-name fallback - Canonical FQN for
enclosing_type_fqnand self/super SSA writes — fixes ~193K false negative edges - Java wildcard imports:
.wildcard_child("asterisk")on import rule is_top_levelfix for single-segment module scopes (Pythontest.py)extract_rhs_nameusesmatches()not justkind()— fixes Python cross-file chain resolution- All 38 YAML Cypher suites passing (was 35/36)
python_type_flowcross-file chain test now passes
Known gaps
- Python
FilePathimport strategy still stubbed (relative imports) self.db.query()instance attribute SSA across sibling methods not yet working- C# resolution rules not implemented (parse-only)
- TypeScript/Ruby/Rust/Go v2 support not started