feat(code-graph): v2 pipeline performance + zero-fuzz resolution (!902) · Merge requests · GitLab.org / orbit / GitLab Knowledge Graph

What does this MR do and why?

Part of #455 (closed), #456 (closed), #457 (closed), #458 (closed). Continues the v2 code-graph pipeline from !887 (merged) with three focus areas: performance optimization, architectural simplification, and correctness fixes. The v2 pipeline went from a multi-representation system (5 intermediate data structures, 3 rebuilds) to a single graph-central architecture where petgraph's DiGraph is the only representation from parse to serialization.

Graph-central architecture

The previous architecture rebuilt the same data multiple times:

Parser → CanonicalResult (flat vecs)
  → ResolutionContext (rebuilds indexes)
    → Resolver → ResolvedEdge (flat vec)
      → GraphBuilder (rebuilds into petgraph)
        → CodeGraph → Arrow batches

Now the graph exists from the start. Each file's parse locks a shared Mutex<CodeGraph>, adds its nodes, gets NodeIndex values back, and walks with them. The resolver reads the graph directly. Edges go straight into the graph. No intermediate types.

Mutex<CodeGraph> created
  │
  ├─ Parallel: parse → lock(graph) → add_file_nodes → unlock → walk with NodeIndex
  │    SSA writes Value::Def(NodeIndex), Value::Import(NodeIndex)
  │
  ├─ Sequential: graph.finalize() (dirs, containment, ancestor chains)
  │
  ├─ Parallel: resolve(&graph, &results, walks) → (NodeIndex, NodeIndex, GraphEdge)
  │
  └─ Sequential: graph.add_edges()

Deleted types: DefRef, EdgeSource, ResolvedEdge, ResolutionContext, DefinitionIndex, MemberIndex, GraphBuilder, linker-side BranchRule/LoopRule/BindingRule/BindingKind

Deleted fields: def_index (replaced by defs_by_file), branches/loops/bindings on ResolutionRules

Module consolidation: 11 linker files → 5 (graph.rs, resolve.rs, walker.rs, ssa.rs, rules.rs)

Parser extracts bindings and control flow

Binding extraction, branch detection, and loop detection moved from the walker to the parser. The walker no longer matches AST node kinds against rule tables — the parser already did that. The walker matches by byte offset against parsed CanonicalBinding and CanonicalControlFlow structs.

// Parser output (new fields on CanonicalResult):
pub struct CanonicalBinding {
    pub name: String,
    pub kind: BindingKind,
    pub type_annotation: Option<String>,  // from AST: "Builder", "int", etc.
    pub rhs_name: Option<String>,         // callee name: get_builder() → "get_builder"
    pub instance_attr: bool,
}

pub struct CanonicalControlFlow {
    pub kind: ControlFlowKind,            // Branch { has_catch_all } | Loop
    pub node_kind: String,
    pub byte_range: (usize, usize),
    pub children: Vec<ControlFlowChild>,
}

Per-language binding/branch/loop rules moved from HasRules to DslLanguage, so DslParser picks them up during parsing:

impl DslLanguage for JavaDsl {
    fn bindings() -> Vec<BindingRule> {
        vec![
            binding("local_variable_declaration", BindingKind::Assignment)
                .name_from(&["declarator", "name"])
                .typed(&["type"], &["int", "long", "void", "String", ...]),
        ]
    }
    fn branches() -> Vec<BranchRule> { /* if, try, switch, ternary */ }
    fn loops() -> Vec<LoopRule> { /* for, while, enhanced_for, do */ }
}

SSA improvements

Value::Alias(IStr) — deferred binding resolution. When the walker sees b = getService(), it writes Value::Alias("getService") instead of doing an SSA read during the walk phase. The resolver follows the alias later with full cross-file context. The walker is write-only for SSA — no reads during walk.

Value::Def(NodeIndex) / Value::Import(NodeIndex) — SSA values reference graph nodes directly. No (file_idx, def_idx) double indirection. graph.def(idx) gives the definition in O(1).

Configurable resolution stages

The hardcoded 3-tier fallback in resolve_bare (SSA → imports → implicit this) replaced by a declarative stage list:

// Java: SSA first, then imports, then implicit member lookup
bare_stages: vec![ResolveStage::SSA, ResolveStage::ImportStrategies, ResolveStage::ImplicitMember]

// Python: SSA and imports only (explicit self, no implicit this)
bare_stages: vec![ResolveStage::SSA, ResolveStage::ImportStrategies]

Hidden heuristics made configurable via ResolveSettings:

chain_fallback: bool — fall back to bare resolution when chain base unresolvable
compound_key_recovery: bool — mid-chain SSA recovery via compound keys
implicit_this_on_base: bool — implicit member lookup on chain bases
per_file_timeout: Option<Duration> — cap resolution time per file
max_chain_depth: usize — truncate long fluent chains

Performance Benchmarks (elasticsearch, 22,935 Java files):

Metric	Value
Parse + walk	~3s
Resolve	~4s (2M edges)
Peak RSS	4.3 GB (was 5.7 GB)

Correctness

Zero-fuzz resolution: removed GlobalName strategy, lookup_fqn_joined, and bare-name fallback
Canonical FQN for enclosing_type_fqn and self/super SSA writes — fixes ~193K false negative edges
Java wildcard imports: .wildcard_child("asterisk") on import rule
is_top_level fix for single-segment module scopes (Python test.py)
extract_rhs_name uses matches() not just kind() — fixes Python cross-file chain resolution
All 38 YAML Cypher suites passing (was 35/36)
python_type_flow cross-file chain test now passes

Known gaps

Python FilePath import strategy still stubbed (relative imports)
self.db.query() instance attribute SSA across sibling methods not yet working
C# resolution rules not implemented (parse-only)
TypeScript/Ruby/Rust/Go v2 support not started

Edited Apr 15, 2026 by Michael Usachenko

feat(code-graph): v2 pipeline performance + zero-fuzz resolution