feat(code-graph): v2 pipeline performance + zero-fuzz resolution

What does this MR do and why?

Part of #455 (closed), #456 (closed), #457 (closed), #458 (closed). Continues the v2 code-graph pipeline from !887 (merged) with three focus areas: performance optimization, architectural simplification, and correctness fixes. The v2 pipeline went from a multi-representation system (5 intermediate data structures, 3 rebuilds) to a single graph-central architecture where petgraph's DiGraph is the only representation from parse to serialization.


Graph-central architecture

The previous architecture rebuilt the same data multiple times:

Parser → CanonicalResult (flat vecs)
  → ResolutionContext (rebuilds indexes)
    → Resolver → ResolvedEdge (flat vec)
      → GraphBuilder (rebuilds into petgraph)
        → CodeGraph → Arrow batches

Now the graph exists from the start. Each file's parse locks a shared Mutex<CodeGraph>, adds its nodes, gets NodeIndex values back, and walks with them. The resolver reads the graph directly. Edges go straight into the graph. No intermediate types.

Mutex<CodeGraph> created

  ├─ Parallel: parse → lock(graph) → add_file_nodes → unlock → walk with NodeIndex
  │    SSA writes Value::Def(NodeIndex), Value::Import(NodeIndex)

  ├─ Sequential: graph.finalize() (dirs, containment, ancestor chains)

  ├─ Parallel: resolve(&graph, &results, walks) → (NodeIndex, NodeIndex, GraphEdge)

  └─ Sequential: graph.add_edges()

Deleted types: DefRef, EdgeSource, ResolvedEdge, ResolutionContext, DefinitionIndex, MemberIndex, GraphBuilder, linker-side BranchRule/LoopRule/BindingRule/BindingKind

Deleted fields: def_index (replaced by defs_by_file), branches/loops/bindings on ResolutionRules

Module consolidation: 11 linker files → 5 (graph.rs, resolve.rs, walker.rs, ssa.rs, rules.rs)


Parser extracts bindings and control flow

Binding extraction, branch detection, and loop detection moved from the walker to the parser. The walker no longer matches AST node kinds against rule tables — the parser already did that. The walker matches by byte offset against parsed CanonicalBinding and CanonicalControlFlow structs.

// Parser output (new fields on CanonicalResult):
pub struct CanonicalBinding {
    pub name: String,
    pub kind: BindingKind,
    pub type_annotation: Option<String>,  // from AST: "Builder", "int", etc.
    pub rhs_name: Option<String>,         // callee name: get_builder() → "get_builder"
    pub instance_attr: bool,
}

pub struct CanonicalControlFlow {
    pub kind: ControlFlowKind,            // Branch { has_catch_all } | Loop
    pub node_kind: String,
    pub byte_range: (usize, usize),
    pub children: Vec<ControlFlowChild>,
}

Per-language binding/branch/loop rules moved from HasRules to DslLanguage, so DslParser picks them up during parsing:

impl DslLanguage for JavaDsl {
    fn bindings() -> Vec<BindingRule> {
        vec![
            binding("local_variable_declaration", BindingKind::Assignment)
                .name_from(&["declarator", "name"])
                .typed(&["type"], &["int", "long", "void", "String", ...]),
        ]
    }
    fn branches() -> Vec<BranchRule> { /* if, try, switch, ternary */ }
    fn loops() -> Vec<LoopRule> { /* for, while, enhanced_for, do */ }
}

SSA improvements

Value::Alias(IStr) — deferred binding resolution. When the walker sees b = getService(), it writes Value::Alias("getService") instead of doing an SSA read during the walk phase. The resolver follows the alias later with full cross-file context. The walker is write-only for SSA — no reads during walk.

Value::Def(NodeIndex) / Value::Import(NodeIndex) — SSA values reference graph nodes directly. No (file_idx, def_idx) double indirection. graph.def(idx) gives the definition in O(1).


Configurable resolution stages

The hardcoded 3-tier fallback in resolve_bare (SSA → imports → implicit this) replaced by a declarative stage list:

// Java: SSA first, then imports, then implicit member lookup
bare_stages: vec![ResolveStage::SSA, ResolveStage::ImportStrategies, ResolveStage::ImplicitMember]

// Python: SSA and imports only (explicit self, no implicit this)
bare_stages: vec![ResolveStage::SSA, ResolveStage::ImportStrategies]

Hidden heuristics made configurable via ResolveSettings:

  • chain_fallback: bool — fall back to bare resolution when chain base unresolvable
  • compound_key_recovery: bool — mid-chain SSA recovery via compound keys
  • implicit_this_on_base: bool — implicit member lookup on chain bases
  • per_file_timeout: Option<Duration> — cap resolution time per file
  • max_chain_depth: usize — truncate long fluent chains

Performance Benchmarks (elasticsearch, 22,935 Java files):

Metric Value
Parse + walk ~3s
Resolve ~4s (2M edges)
Peak RSS 4.3 GB (was 5.7 GB)

Correctness

  • Zero-fuzz resolution: removed GlobalName strategy, lookup_fqn_joined, and bare-name fallback
  • Canonical FQN for enclosing_type_fqn and self/super SSA writes — fixes ~193K false negative edges
  • Java wildcard imports: .wildcard_child("asterisk") on import rule
  • is_top_level fix for single-segment module scopes (Python test.py)
  • extract_rhs_name uses matches() not just kind() — fixes Python cross-file chain resolution
  • All 38 YAML Cypher suites passing (was 35/36)
  • python_type_flow cross-file chain test now passes

Known gaps

  • Python FilePath import strategy still stubbed (relative imports)
  • self.db.query() instance attribute SSA across sibling methods not yet working
  • C# resolution rules not implemented (parse-only)
  • TypeScript/Ruby/Rust/Go v2 support not started
Edited by Michael Usachenko

Merge request reports

Loading