Proposal: 5+ phases of work for the knowledge graph
Phase 1: Definitions and imports only
Let's start by rolling out a minimally viable version of the graph. It should contain three entity types:
- Directories
- Files
- Imports
- Definitions
The graph will include only structural relationships:
- Directory -> child files/subdirectories
- File -> definitions it contains
- File -> symbols it imports
- Definition -> sub-definitions it contains (e.g. nested functions, methods in a class)
We will ignore references in this phase. While a graph without references isn't very useful for RAG, it provides a few immediate benefits:
- We ship something faster. Gives us a base to start iterating on.
- Establishes a pattern for how languages will be supported by the knowledge graph.
- Once a pattern is set, we can parallelize more work (e.g. language support) across team members.
Notably, what we consider a definition depends on the language. In general, we'd like to treat all callable objects with names as definitions (e.g. functions and classes), but there may be exceptions for non-callable objects (e.g. type definitions in TypeScript) depending on the language. More details on this in gitlab-code-parser#29 (closed).
Phase 2: Same-file references
The next step is to parse all the references from a repository and use these to link together definition nodes in the graph. For this phase, we are only linking nodes within the same file. We are ignoring references to functions imported from other files –– these will be resolved in phase 3.
Notably, we are defining a reference as a function call, but we will extend this definition in a later phase of work. The goal of this phase, roughly speaking, is to achieve parity with the GitHub code navigator (and/or LSP in our IDEs) for same-file references.
Phase 3: Cross-file references
The parser will capture all references in a file, but some of those will be references to imported functions rather than function defined in the same file. We need to resolve those references and link definitions across files.
The parser will already provide the imported symbols for every file, along with their FQNs. The indexer will use the imports to locate the original definition for a cross-file reference and create a link. How this works will vary language-by-language, e.g. here's an overview of how it might work in Python.
Alternatively, to get this out the door faster, we can use a language-agnostic approach that doesn't involve import resolution to compute cross-file references. This approach is a lot easier to implement, but it will result in a less accurate graph, so we should eventually do import resolution in a later phase of work.
Phase 4: Dependency resolution
Read more about this step here.
Phase 5: Extending definitions and references beyond callable objects
The ideal version of the graph treats every statement binding an object to a name, even variable assignments, as a definition. And every instance of a definition name, even if it's not a call, as a reference. This is the graph that LSPs enable, and it's what we should work towards. More details on how and why can be found in gitlab-code-parser#29 (closed) and gitlab-code-parser#17.