Skip to content

[Rust Code Parser] Parse and chunk files into logical code elements

Context

For an overview of the Code Parsing and Chunking Strategy, please refer to #528770 (comment 2451039319).

In this issue:

  • we need to introduce the code parsing function needed to chunk code files into logical code elements
  • the parser should chunk the files into top-level code segments using tree-sitter
  • the parser will need to be introduced in a new rust-based project called the gitlab-code-parser

I/O Contract

The Go-based gitlab-elasticsearch-indexer will call the parser using a Code Parsing Chunker class.

As part of this issue, we need to finalize the I/O contract between the Code Parsing Chunker and the Rust Code Parser. The main things to consider are:

  • we need to use a data structure that's most performant for FFI communications
  • For the input (Code Parsing Chunker -> Rust Code Parser), this should be an array of files, with each file having the following fields:
    • file_path
    • file_content
  • For the output (Rust Code Parser -> Code Parsing Chunker), this should be an array of chunks, with each chunk having the following fields:
    • file_path
    • content_type (e.g.: method|class|import|etc)
    • content_name (e.g.: ModuleName::ClassName::method_name)
    • language
    • start_byte
    • end_byte

Note that the Rust Code Parser doesn't need to return the content of the chunks. The Code Parsing Chunker class in the Go Elasticsearch Indexer can determine this from the file content and the start_byte and end_byte of the chunks.

Prerequisites

References

Resources:

Reference Persons

  • @michaelangeloio
  • @partiaga

Proposal

TBA

Edited by 🤖 GitLab Bot 🤖