[Go Indexer] Introduce the Code Parsing Chunker
Context
(For an overview of the Code Parsing and Chunking Strategy, please refer to #528770 (comment 2451039319).)
Currently, the gitlab-elasticsearch-indexer
uses a SizeChunker that chunks code content by size. In this issue, we will replace it with the Code Chunker in the gitlab-code-parser, which exposes a Go package that can be used in the gitlab-elasticsearch-indexer
Current Process Flow
-
gitlab-elasticsearch-indexer
has achunk
mode indexer implemented ininternal/mode/chunk/chunk.go
- in
internal/mode/chunk/chunk.go
, we create the SizeChunker and pass it to the ChunkIndexer - we start indexing by calling
ChunkIndexer.PerformIndexing
-
ChunkIndexer.PerformIndexing
fetches the files from gitaly, and passes them to aBulkDelete
andBulkIndex
function -
BulkIndex
uses the current chunker (a SizeChunker) to chunk the files -
BulkIndex
then passes the chunks to the vectorStoreIndexer to be indexed in the current vector store (elasticsearch)
Prerequisites
- the Go-based gitlab-elasticsearch-indexer should introduce a "chunking" mode, see #536209 (closed)
- the
gitlab-code-parser
should expose its Code Chunker as a Go package, see gitlab-org/rust/gitlab-code-parser!85 (merged)
References
gitlab-elasticsearch-indexer
Running the
Planning discussions
- Code Parsing and Chunking Strategy proposal
- Code Embeddings blueprint
- Please refer to the outcome in #536142 (closed)
- Please refer to the Indexer & Chunker contract defined in #528770 (comment 2451039319)
Proposal
- Introduce the
gitlab-code-parser
's Code Chunker into thegitlab-elasticsearch-indexer
. This should be wrapped as a sub-package in the chunker package, similar to the SizeChunker - In step 2 of the process flow defined above, replace the creation of the Size Chunker with creating the new Code Chunker
- Ensure that Code Chunker outputs all the same
Chunk
fields as theSizeChunker
, defined in the parentchunker
package - Ensure that the output of the Code Chunker can be used in the call to the vectorStoreIndexer
Edited by 🤖 GitLab Bot 🤖