feat(code-graph): code graph construction pipeline with SSA-based generic resolver

What does this MR do and why?

Part of #455 (closed), #456 (closed), #457 (closed), #458 (closed). Cleanroom v2 code-graph pipeline that replaces V1's ~300 enum variants across ~20 enums with a fully generic, language-agnostic architecture. All code coexists alongside v1 via v2/ modules.

This MR is originally based on some exploratory MR(s): !885 (closed), !766 (closed), !767 (closed), !764 (closed) - which were themselves inspired by gitlab-code-parser#38, knowledge-graph#3.

**Note: This MR will be merged in its current form, as the linked issues this MR implements are highly coupled. But, v2/ is a parallel implementation that is not active in production yet, and will be heavily verified, and possibly refactored before eventually deprecating the original code-graph stack.


Declarative Parsing DSL

Hand-written canonical parsers (~2500 lines) replaced by declarative DslLanguage specs (~350 lines total). The DSL engine walks tree-sitter ASTs using rule tables — no imperative parsing code per language.

// Adding a new language: one file of declarative rules
impl DslLanguage for PythonDsl {
    fn scopes() -> Vec<ScopeRule> {
        vec![
            scope("class_definition", "Class")
                .def_kind(DefKind::Class)
                .metadata(metadata().super_types(ExtractList::Fn(python_super_types))),
            scope_fn("function_definition", classify_python_function)
                .def_kind(DefKind::Function)
                .metadata(metadata().decorators(ExtractList::Fn(python_decorators))),
        ]
    }
    fn refs() -> Vec<ReferenceRule> {
        vec![
            reference("call")
                .when(field_kind("function", &["attribute"]))
                .name_from(Extract::FieldChain(&["function", "attribute"])),
            reference("call").name_from(field("function")),
        ]
    }
    fn imports() -> Vec<ImportRule> {
        vec![
            import("import_statement").classify(classify).path_from(Extract::None)
                .multi(&["dotted_name"]).alias_child("aliased_import"),
            import("import_from_statement").classify(classify)
                .path_from(field("module_name")).multi(&["dotted_name", "identifier"])
                .alias_child("aliased_import"),
        ]
    }
    fn bindings() -> Vec<ParseBindingRule> {
        vec![
            parse_binding("assignment").name_from(field("left")).value_from(field("right")),
        ]
    }
}

DSL primitives:

  • ScopeRule: .when(pred), .name_from(extract), .def_kind(), .metadata(), .no_scope()
  • ReferenceRule: .when(pred), .name_from(extract)
  • ImportRule: .classify(), .multi(), .alias_child(), .split_last(), .path_from()
  • ParseBindingRule: .name_from(), .value_from(), .no_value()
  • MetadataRule: .super_types(), .return_type(), .decorators(), .receiver_type()
  • Extract: Default, None, Field, ChildOfKind, FieldChain, Declarator
  • ExtractList: ChildrenOfField, ChildrenOfKind, FieldSplit, Decorators, Fn
  • Pred: parent_is(), grandparent_is(), ancestor_is(), field_kind(), has_name()

Pipeline wiring:

register_v2_pipelines! {
    // Generic: DSL parser + SSA resolver
    Python  => GenericPipeline<DslParser<PythonDsl>, RulesResolver<PythonRules>>,
    Java    => GenericPipeline<DslParser<JavaDsl>, RulesResolver<JavaRules>>,
    Kotlin  => GenericPipeline<DslParser<KotlinDsl>, RulesResolver<KotlinRules>>,
    CSharp  => GenericPipeline<DslParser<CSharpDsl>, NoResolver>,
    // Custom: full control over parse + link
    Ruby    => crate::v2::custom::ruby::RubyPipeline,
}

SSA-Based Generic Resolver

Reference resolution uses the Braun et al. algorithm ("Simple and Efficient Construction of Static Single Assignment Form", CC 2013) for reaching-definition analysis. One generic engine drives all languages — per-language differences are declarative rule tables.

Source Code → DslParser → CanonicalResult (defs, imports, refs, bindings)


                     Scoped SSA Walker
                     (one block per function/class scope,
                      predecessor edges for scope nesting)

                     write_variable / read_variable / seal_block


                        SsaResolver (Braun et al.)
                        phi insertion at join points
                        trivial phi elimination


                     ReachingResolver + Import Strategies
                     (scope FQN walk, explicit import, wildcard,
                      same-package, same-file, global name)


                     petgraph CodeGraph with call edges

Key features:

  • Scoped walker: one SSA block per function/class scope, not one flat block per file. Bindings in different scopes can't collide.
  • Phi nodes at control flow merge points (if/else, try/except, for/while)
  • Trivial phi elimination (single-value phis collapsed on the fly)
  • EdgeSource::File for module-level calls (File→Definition edges)
  • CanonicalBinding with scope_fqn so the walker knows which block to write to

Declarative resolution rules:

ResolutionRules {
    scopes: vec![isolated_scope("class_definition", ScopeKind::Class), ...],
    branches: vec![branch("if_statement").branches(&["block", "else_clause"]).catch_all("else_clause"), ...],
    loops: vec![loop_rule("for_statement").iter_over("right"), ...],
    bindings: vec![binding("assignment", BindingKind::Assignment), ...],
    import_strategies: vec![ScopeFqnWalk, ExplicitImport, WildcardImport, SamePackage, ...],
    chain_mode: ChainMode::ValueFlow,  // Python: follow aliases
    receiver: ReceiverMode::Convention { ... },
}

Canonical Types

All language-specific types eliminated. One set of canonical types flows through the entire pipeline:

  • CanonicalParser trait (in code-graph-types, not parser-core)
  • CanonicalDefinition with DefinitionMetadata (super_types, return_type, decorators, receiver_type)
  • CanonicalReference with ExpressionStep chains (Ident, Field, Call, New, This, Super, Index)
  • CanonicalBinding with scope_fqn for scoped SSA resolution
  • CanonicalResult as the single boundary type between parser and linker
  • CodeGraph backed by petgraph::DiGraph<GraphNode, GraphEdge> with AsRecordBatch impls
  • Fqn, Range, DefKind, EdgeKind, NodeKind, ScopeIndex — all with PartialEq + Eq

Call Graph Validator (Cypher-based YAML test suites)

Graph validation using lance-graph Cypher queries against Arrow datasets. Each YAML file is a self-contained test suite with inline source fixtures and exact assertions.

name: "Python: intra-file resolution"
fixtures:
  - path: main.py
    content: |
      def option1(): pass
      def option2(): pass
      def caller():
          x = option1
          x = option2
          x()

tests:
  - name: "Reassignment: x() resolves to option2 (latest wins)"
    queries:
      - query: |
          MATCH (caller:Definition)-[:DefinitionToDefinition]->(callee:Definition)
          WHERE caller.name = 'caller' AND callee.name = 'option2'
          RETURN caller.fqn AS caller, callee.fqn AS callee
        assert:
          - { row_count: 1 }
          - { contains_row: { caller: "caller", callee: "option2" } }
      - query: |
          MATCH (caller:Definition)-[:DefinitionToDefinition]->(callee:Definition)
          WHERE caller.name = 'caller' AND callee.name = 'option1'
          RETURN callee.name AS wrong_target
        assert:
          - { empty: true }

Run locally: mise test:graph-validator

Assertion types: empty, non_empty, row_count, count_equals, count_gte, all_match (glob), contains_row (exact field match)

Test case features: skip, description, queries (multi-query per test), severity (error/warning)

35 YAML Cypher suites, all passing:

Category Suites What's tested
Structural 2 Containment hierarchy, file/dir edges, no self-refs
Python definitions 3 All def types, FQN nesting, lambdas, async, decorators, comprehensives
Python imports 4 From-import, cross-file, multi-file chains, C1-C11 extraction, aliased/partial
Python resolution 12 Simple call, self.method, reassignment, recursive, scope shadows, for-loop, comprehension, try/except, match/case, walrus, f-string, conditional bindings, chained/callable, multiple inheritance
Python inter-file 3 3-layer project, import chains, G1-G11 patterns, unused detection
Java 3 Intra-file (same-class, static), inter-file (cross-package, wildcard imports, FQN)
Kotlin 2 Intra-file (method calls, top-level, FQN), package scoping

Adding a new language

  1. Write a DslLanguage impl (~80 lines of declarative rules)
  2. Write a ResolutionRules struct (~30 lines of declarative config)
  3. Add one line to register_v2_pipelines!

No imperative code needed. No AST walker to write. No resolver to implement.

Edited by Michael Usachenko

Merge request reports

Loading