Proposal: Linking references to definitions should always use global backtracking as an initial pass
In discussions with @michaelangeloio and @jshobrook1, we have discussed various approaches to linking references to definitions. There are many ways to tackle this problem. The "correct" way would be to build an import graph for a given language, and traverse it. An import graph, meaning, a language specific trace of how imported definitions come into scope, to be referenced.
Consider the following Python code:
# utils/stuff.py
def foo():
pass
# utils/__init__.py
from utils.stuff import foo
# main.py
from utils import foo
foo()
Without an import graph, the FQN of foo
would be utils.foo
, which is incorrect. The true location of foo
is utils.stuff.foo
. This can only be done by tracing the import back to it's original location.
While this is not particularly difficult for Python, it does require language specific knowledge of how imports are resolved, and implementation difficultly will vary by language, especially if there is runtime aliasing/scope resolution, like JS permits.
Global backtracking is a great "quick and dirty" solution that is language-agnostic, and can easily precede a language specific import crawler.
How Global Backtracking Works & Benefits
In our example, foo()
is called in main.py
. If you iterate through all of the definition names in a particular codebase, and find that def foo
is globally unique, then you can confidently say that the FQN of foo
will necessarily be utils.stuff.foo
. You can entirely avoid building an import graph this way, until you need to. Global backtracking gives you homogenous starting point for all language-specific linkers, and permits a partial callgraph for a language to be shipped out of the box.
It is also important to note that global backtracking does not create a complete callgraph, only an partial one, but partial is shippable, and will still cover a non-trivial portion of a codebase. You can also global backtracking to create an approximate callgraph, but more on that later.
Handling Edge Cases
This is non-exhaustive, but if foo
is aliased, or we have a third-party import that also has a definition called foo
- we would need to account for that during the backtracking step. For the latter example, we could either a) check the file(s) which call foo
and see if their import location is in the namespace from pypipackage import foo
vs. from utils import foo
, or b) mark foo
as a ambiguous reference and exclude it from the partial callgraph. In general, we want to make sure that the probability of false edges are minimized.
Handling Non-Unique Definitions
So far, we have only discussed completely unique definitions e.g foo
in our example code. What if foo
is not globally unique, and has multiple definitions in a codebase? There are a few equally valid approaches here. We could either skip the definition entirely, or mark the reference to foo
as ambiguous, and subject the reference to additional language-agnostic heuristic passes to attempt to disambiguate it. A simple example is argument matching: if we have def foo()
and def foo(a,b)
and foo()
is called, we can assume with high confidence that def foo()
is the function being called.
Proposal Option 1: Global Backtracking for Partial, Correct* Callgraph
In this option, global backtracking would only permit references to be linked to definitions that we know are 100% globally unique e.g are unaliased, and don't have any ambiguities in the import namespace, among other edge cases. For all other references, e.g ones that aren't globally unique, we'd resolve them with a language-specific import graph. The role of the import graph in option one would be to add edges.
*: There's still a risk of false edges here. We won't have absolute confidence that an imported reference belongs to the local import namespace or a 3rd-party dependency - without applying some import heuristics.
Proposal Option 2: Global Backtracking for a Complete, Approximate Callgraph
In this option, global backtracking would permit both references to defns that are 100% globally unique, and also include references that aren't globally unique, and are ambiguous. Those references would be marked in memory as such, would be subjected to follow heuristic passes, that are still language agnostic, like argument matching. To note, the heuristic passes do have a small probability of creating false edges, depending on the language. After the heuristic passes, we'd use the import graph to perform any remaining disambiguations. The role of the import graph in option two would be to remove edges.
Outcome
The best benefit I see from global backtracking is that we push back language specific work for resolving references as far back as possible, and all implementations of language-specific linkers start from the same place, either with a) a partial and correct callgraph or b) a complete, but approximate callgraph.
Starting from the same place has another benefit (quoting myself earlier in the issue):
"[it] permits a partial callgraph for a language to be shipped out of the box"
This means that as soon as a language implementation is made in the One Parser, it can immediately make it's way to production without needing to do any language-specific linking work. Of course, the caveat being that a partial (or approximate) callgraph will necessarily either have gaps in retrieval, or retrieve the wrong things.