feat(code-graph): code graph construction pipeline with SSA-based generic resolver
What does this MR do and why?
Part of #455 (closed), #456 (closed), #457 (closed), #458 (closed). Cleanroom v2 code-graph pipeline that replaces V1's ~300 enum variants across ~20 enums with a fully generic, language-agnostic architecture. All code coexists alongside v1 via v2/ modules.
This MR is originally based on some exploratory MR(s): !885 (closed), !766 (closed), !767 (closed), !764 (closed) - which were themselves inspired by gitlab-code-parser#38, knowledge-graph#3.
**Note: This MR will be merged in its current form, as the linked issues this MR implements are highly coupled. But,
v2/is a parallel implementation that is not active in production yet, and will be heavily verified, and possibly refactored before eventually deprecating the originalcode-graphstack.
Declarative Parsing DSL
Hand-written canonical parsers (~2500 lines) replaced by declarative DslLanguage specs (~350 lines total). The DSL engine walks tree-sitter ASTs using rule tables — no imperative parsing code per language.
// Adding a new language: one file of declarative rules
impl DslLanguage for PythonDsl {
fn scopes() -> Vec<ScopeRule> {
vec![
scope("class_definition", "Class")
.def_kind(DefKind::Class)
.metadata(metadata().super_types(ExtractList::Fn(python_super_types))),
scope_fn("function_definition", classify_python_function)
.def_kind(DefKind::Function)
.metadata(metadata().decorators(ExtractList::Fn(python_decorators))),
]
}
fn refs() -> Vec<ReferenceRule> {
vec![
reference("call")
.when(field_kind("function", &["attribute"]))
.name_from(Extract::FieldChain(&["function", "attribute"])),
reference("call").name_from(field("function")),
]
}
fn imports() -> Vec<ImportRule> {
vec![
import("import_statement").classify(classify).path_from(Extract::None)
.multi(&["dotted_name"]).alias_child("aliased_import"),
import("import_from_statement").classify(classify)
.path_from(field("module_name")).multi(&["dotted_name", "identifier"])
.alias_child("aliased_import"),
]
}
fn bindings() -> Vec<ParseBindingRule> {
vec![
parse_binding("assignment").name_from(field("left")).value_from(field("right")),
]
}
}DSL primitives:
ScopeRule:.when(pred),.name_from(extract),.def_kind(),.metadata(),.no_scope()ReferenceRule:.when(pred),.name_from(extract)ImportRule:.classify(),.multi(),.alias_child(),.split_last(),.path_from()ParseBindingRule:.name_from(),.value_from(),.no_value()MetadataRule:.super_types(),.return_type(),.decorators(),.receiver_type()Extract:Default,None,Field,ChildOfKind,FieldChain,DeclaratorExtractList:ChildrenOfField,ChildrenOfKind,FieldSplit,Decorators,FnPred:parent_is(),grandparent_is(),ancestor_is(),field_kind(),has_name()
Pipeline wiring:
register_v2_pipelines! {
// Generic: DSL parser + SSA resolver
Python => GenericPipeline<DslParser<PythonDsl>, RulesResolver<PythonRules>>,
Java => GenericPipeline<DslParser<JavaDsl>, RulesResolver<JavaRules>>,
Kotlin => GenericPipeline<DslParser<KotlinDsl>, RulesResolver<KotlinRules>>,
CSharp => GenericPipeline<DslParser<CSharpDsl>, NoResolver>,
// Custom: full control over parse + link
Ruby => crate::v2::custom::ruby::RubyPipeline,
}
SSA-Based Generic Resolver
Reference resolution uses the Braun et al. algorithm ("Simple and Efficient Construction of Static Single Assignment Form", CC 2013) for reaching-definition analysis. One generic engine drives all languages — per-language differences are declarative rule tables.
Source Code → DslParser → CanonicalResult (defs, imports, refs, bindings)
│
▼
Scoped SSA Walker
(one block per function/class scope,
predecessor edges for scope nesting)
│
write_variable / read_variable / seal_block
│
▼
SsaResolver (Braun et al.)
phi insertion at join points
trivial phi elimination
│
▼
ReachingResolver + Import Strategies
(scope FQN walk, explicit import, wildcard,
same-package, same-file, global name)
│
▼
petgraph CodeGraph with call edgesKey features:
- Scoped walker: one SSA block per function/class scope, not one flat block per file. Bindings in different scopes can't collide.
- Phi nodes at control flow merge points (if/else, try/except, for/while)
- Trivial phi elimination (single-value phis collapsed on the fly)
EdgeSource::Filefor module-level calls (File→Definition edges)CanonicalBindingwithscope_fqnso the walker knows which block to write to
Declarative resolution rules:
ResolutionRules {
scopes: vec![isolated_scope("class_definition", ScopeKind::Class), ...],
branches: vec![branch("if_statement").branches(&["block", "else_clause"]).catch_all("else_clause"), ...],
loops: vec![loop_rule("for_statement").iter_over("right"), ...],
bindings: vec![binding("assignment", BindingKind::Assignment), ...],
import_strategies: vec![ScopeFqnWalk, ExplicitImport, WildcardImport, SamePackage, ...],
chain_mode: ChainMode::ValueFlow, // Python: follow aliases
receiver: ReceiverMode::Convention { ... },
}Canonical Types
All language-specific types eliminated. One set of canonical types flows through the entire pipeline:
CanonicalParsertrait (incode-graph-types, notparser-core)CanonicalDefinitionwithDefinitionMetadata(super_types, return_type, decorators, receiver_type)CanonicalReferencewithExpressionStepchains (Ident, Field, Call, New, This, Super, Index)CanonicalBindingwithscope_fqnfor scoped SSA resolutionCanonicalResultas the single boundary type between parser and linkerCodeGraphbacked bypetgraph::DiGraph<GraphNode, GraphEdge>withAsRecordBatchimplsFqn,Range,DefKind,EdgeKind,NodeKind,ScopeIndex— all withPartialEq + Eq
Call Graph Validator (Cypher-based YAML test suites)
Graph validation using lance-graph Cypher queries against Arrow datasets. Each YAML file is a self-contained test suite with inline source fixtures and exact assertions.
name: "Python: intra-file resolution"
fixtures:
- path: main.py
content: |
def option1(): pass
def option2(): pass
def caller():
x = option1
x = option2
x()
tests:
- name: "Reassignment: x() resolves to option2 (latest wins)"
queries:
- query: |
MATCH (caller:Definition)-[:DefinitionToDefinition]->(callee:Definition)
WHERE caller.name = 'caller' AND callee.name = 'option2'
RETURN caller.fqn AS caller, callee.fqn AS callee
assert:
- { row_count: 1 }
- { contains_row: { caller: "caller", callee: "option2" } }
- query: |
MATCH (caller:Definition)-[:DefinitionToDefinition]->(callee:Definition)
WHERE caller.name = 'caller' AND callee.name = 'option1'
RETURN callee.name AS wrong_target
assert:
- { empty: true }Run locally: mise test:graph-validator
Assertion types: empty, non_empty, row_count, count_equals, count_gte, all_match (glob), contains_row (exact field match)
Test case features: skip, description, queries (multi-query per test), severity (error/warning)
35 YAML Cypher suites, all passing:
| Category | Suites | What's tested |
|---|---|---|
| Structural | 2 | Containment hierarchy, file/dir edges, no self-refs |
| Python definitions | 3 | All def types, FQN nesting, lambdas, async, decorators, comprehensives |
| Python imports | 4 | From-import, cross-file, multi-file chains, C1-C11 extraction, aliased/partial |
| Python resolution | 12 | Simple call, self.method, reassignment, recursive, scope shadows, for-loop, comprehension, try/except, match/case, walrus, f-string, conditional bindings, chained/callable, multiple inheritance |
| Python inter-file | 3 | 3-layer project, import chains, G1-G11 patterns, unused detection |
| Java | 3 | Intra-file (same-class, static), inter-file (cross-package, wildcard imports, FQN) |
| Kotlin | 2 | Intra-file (method calls, top-level, FQN), package scoping |
Adding a new language
- Write a
DslLanguageimpl (~80 lines of declarative rules) - Write a
ResolutionRulesstruct (~30 lines of declarative config) - Add one line to
register_v2_pipelines!
No imperative code needed. No AST walker to write. No resolver to implement.