[proposal] Spike: Resolving Definition FQNs with a ORM-like rules engine
Background
Since starting phase 1 of the Knowledge Graph project, we've already built 4 parsers (Ruby, JS, Python, Kotlin) that more or less, use the same pattern to resolve FQNs - but go about it via different implementation details.
As the One Parser project matures, and more languages are supported, a lot of code will end up duplicated for no good reason: increasing the probability of error(s), performance and best practices will drift between implementations, and a lot of subtle details will be created for future package maintainers to sift through.
Given the tree structure of ASTs, the correct way to construct FQNs is via DFS. You visit a particular node in the tree, like class A { function b(){} }
, create a new scope on class A
, and then visit all that scope's children function b(){}
, before terminating the parent scope. You'd end up with two FQN's in this case file.A
, and file.A.b
.
The logic for traversal is language-agnostic, and the rules for deciding whether or not a node in the tree creates a new scope either requires looking at the current node e.g class_definition
-> class A {}
, or at an ancestor node const foo = () => {}
-> parent_of("arrow_function") == "variable_declarator"
. These rules require language-specific inputs, but are themselves language-agnostic, as their "language" is the language of tree traversal.
Proposal
A DSL-like developer experience that substantially reduces the time it takes to resolve FQNs for Definitions by abstracting away most tree-traversal details, and leaving the core question of "Does this node create a new scope?" to a set of rules in a rule engine that always resolves to True
or False
.
For any given language, there would be two steps necessary to implement this experience:
- A corpus of scope creating node kinds e.g {
class_definition
,arrow_function
, ...} - A list of rules, e.g
ScopeRule[]
In addition, there would be some "hidden" assumptions baked into the execution of list of scope rules:
- If the list of scope rules is empty, then by default, any node kind in the corpus automatically creates a new scope.
- If a node kind is in the corpus, but doesn't have a corresponding
ScopeRule
, it will automatically create a scope. - If a scope rule deeper in the rules list contradicts another, e.g
rules[0]
is contradicted byrules[5]
because they both apply a rule toarrow_function
, thenrules[5]
automatically takes precedence. - If a rule evaluates to
True
, then a "name extraction" callback will be automatically applied to yield thename
portion of for thatFQNPart
. - If a rule evaluates to
True
, then a "range extraction" callback will be automatically applied to yield theRange
portion for thatFQNPart
. This is important in edge cases like including the decorator line in a function FQN for languages like Python.
For each scope rule the interior logic would look something like this:
Simple example:
ScopeRule {
predicate: Box::new(
any(vec!["class_declaration", "method_definition", "function_declaration", "generator_function_declaration"])
),
// name_extractor doesn't need to be included (applied by default)
// range_extractor doesn't need to be included (applied by default)
}
More advanced example:
ScopeRule {
predicate: Box::new(
any(vec!["class", "arrow_function", "function_expression", "generator_function"])
.and(parent_field_any(vec![
("variable_declarator", "name"),
("field_definition", "property"),
]))
),
name_extractor: Box::new(
extract_from_parent_fields(vec![
("variable_declarator", "name"),
("field_definition", "property"),
])
),
}
These two examples cover 100% of the test cases for the merged JS definitions parser. In fact, you wouldn't even need the simple example at all, because by default, node kinds in the corpus will create new scopes.
Implications
When considering the future of the One Parser project - our team has already discussed unifying the "boring stuff" like YAML loading, etc for language-specific parser config, and I believe this is the final step in the direction that makes new language implementations a single file.
As discussed earlier in the issue, the vast majority of scoping cases will be covered by the logical progression of:
- Evaluate a predicate, and determine whether or not a new scope should be created.
- If a new scope is created, extract the
name
for theFQNPart
. - If a new scope is created, extract the
Range
for theFQNPart
.
All 3 parts are configurable, and for the 2) and 3) - you'd be able to set a default "evaluation" to cover simple name and range assignments by default.
There are, of course, lingering questions (in no particular order) about:
- Performance
- Expanding to other FQN flavors like Imports
- Removing the current
DefinitionExtractor
pattern in favor of theScopeRule[]
pattern -
Removing
ast-grep
all together - The "right" standard library (helpers like
parent_field_any
,extract_from_parent_fields
, etc) - Any edge cases where this fails, so we can iterate on the spec
- Usage of this design pattern in the linking phase of the KG Indexer
Next Steps
I'd like to collect some feedback on this, but I'm going to whip up a very crude exploratory MR for this shortly. We'll quickly know it's a good pattern if we can replace multiple definition parsers in one go with this engine.