[Rust Code Parser] Parse and chunk files into logical code elements
Context
For an overview of the Code Parsing and Chunking Strategy, please refer to #528770 (comment 2451039319).
In this issue:
- we need to introduce the code parsing function needed to chunk code files into logical code elements
- the parser should chunk the files into top-level code segments using
tree-sitter
- the parser will need to be introduced in a new rust-based project called the
gitlab-code-parser
I/O Contract
The Go-based gitlab-elasticsearch-indexer will call the parser using a Code Parsing Chunker
class.
As part of this issue, we need to finalize the I/O contract between the Code Parsing Chunker
and the Rust Code Parser
. The main things to consider are:
- we need to use a data structure that's most performant for FFI communications
- For the input (Code Parsing Chunker -> Rust Code Parser), this should be an array of files, with each file having the following fields:
file_path
file_content
- For the output (Rust Code Parser -> Code Parsing Chunker), this should be an array of chunks, with each chunk having the following fields:
file_path
-
content_type
(e.g.: method|class|import|etc) -
content_name
(e.g.:ModuleName::ClassName::method_name
) language
start_byte
end_byte
Note that the Rust Code Parser
doesn't need to return the content of the chunks. The Code Parsing Chunker
class in the Go Elasticsearch Indexer can determine this from the file content and the start_byte
and end_byte
of the chunks.
Prerequisites
- The
gitlab-code-parser
must be created, see: #534153 (comment 2440502912) and #536077 (closed)
References
Resources:
- Proposal: Create "One Parser" - A Unified Stati... (#534153 - closed)
- Code Parsing and Chunking Strategy proposal
- Rust Tree Sitter documentation
-
ast-grep - this allows you to introduce a "polyglot" query that will apply to all languages
- example: playground link
- for more information, please check in with
@michaelangeloio
- Experiments for using Go + Rust + Tree-Sitter:
Reference Persons
@michaelangeloio
@partiaga
Proposal
TBA
Edited by 🤖 GitLab Bot 🤖