Declarative DSL engine for tree-sitter parsing (#456) · Issues · GitLab.org / orbit / GitLab Knowledge Graph

Declarative DSL engine for tree-sitter parsing

## Problem to Solve Each language's parser is an imperative tree-sitter walker written from scratch. The 7 current parsers total ~30,800 lines. Despite structural similarity (walk the AST, extract definitions at scope boundaries, extract references at call sites, extract imports), each one reimplements the same walk with different variable names: ``` Python parser: 6,715 lines (references.rs alone: 2,076 lines) Ruby parser: 5,560 lines (definitions.rs: 1,111 lines) TypeScript parser: 5,594 lines (4 SWC sub-modules) Kotlin parser: 3,922 lines (ast.rs: 3,100 lines) Rust parser: 4,161 lines (fqn.rs: 1,592 lines) Java parser: 3,143 lines (ast.rs: 2,484 lines) C# parser: 1,705 lines ``` The actual language-specific logic in each parser is small — which tree-sitter node kinds are scopes, which are references, how to extract import paths. The rest is tree-walking boilerplate. ## Proposed Solution A declarative DSL where each language is a table of rules. The DSL engine walks the tree-sitter AST once and applies matching rules to produce a `CanonicalResult` directly. ```rust impl DslLanguage for PythonDsl { fn scopes() -> Vec<ScopeRule> { vec![ scope("class_definition", "Class").def_kind(DefKind::Class) .metadata(metadata().super_types(ExtractList::Fn(python_super_types))), scope_fn("function_definition", classify_fn).def_kind(DefKind::Function) .metadata(metadata().decorators(ExtractList::Fn(python_decorators))), ] } fn refs() -> Vec<ReferenceRule> { vec![ reference("call") .when(field_kind("function", &["attribute"])) .name_from(Extract::FieldChain(&["function", "attribute"])), reference("call").name_from(field("function")), ] } fn imports() -> Vec<ImportRule> { /* ... */ } fn bindings() -> Vec<ParseBindingRule> { /* ... */ } } ``` The Python spec is ~100 lines. The entire DSL engine is ~1,200 lines. Together they replace a 6,715-line hand-written parser. **DSL primitives:** | Concept | Builder | Purpose | |---|---|---| | `ScopeRule` | `.when()`, `.name_from()`, `.def_kind()`, `.metadata()`, `.no_scope()` | Match tree-sitter node kinds to definitions | | `ReferenceRule` | `.when()`, `.name_from()` | Match call sites / usages | | `ImportRule` | `.classify()`, `.multi()`, `.alias_child()`, `.split_last()`, `.path_from()` | Handle import syntax variations | | `ParseBindingRule` | `.name_from()`, `.value_from()`, `.no_value()` | Extract variable assignments for SSA | | `Extract` | `Default`, `None`, `Field`, `ChildOfKind`, `FieldChain`, `Declarator` | Pull text from a single node | | `ExtractList` | `ChildrenOfField`, `ChildrenOfKind`, `FieldSplit`, `Decorators`, `Fn` | Pull text from multiple nodes | | `Pred` | `parent_is()`, `field_kind()`, `ancestor_is()`, `has_name()` | Conditional rule application | The DSL handles the hard parts of import parsing that differ across languages: Python's `from X import a, b, c` (`.multi()`), Java's `import com.example.Foo` (`.split_last()`), Python's `import X as Y` (`.alias_child()`), and wildcard vs. explicit classification (`.classify()`). ## Acceptance Criteria - [ ] `LanguageSpec` struct with `ScopeRule`, `ReferenceRule`, `ImportRule`, `ParseBindingRule` - [ ] `Extract` and `ExtractList` enums for flexible node text extraction - [ ] `Pred` predicate system for conditional rules - [ ] `DslParser<L: DslLanguage>` implementing `CanonicalParser` - [ ] `LanguageSpec.package()` for namespace/package scope pushing - [ ] `LanguageSpec.custom_import()` for complex import handling (e.g., Python `__future__`) - [ ] DSL specs for Python (~100 lines), Java (~110 lines), Kotlin (~130 lines), C# (~80 lines) - [ ] Parser output matches V1 for supported constructs (verified by graph validator suites)

issue