[Go Indexer] Introduce the Code Parsing Chunker
Context
(For an overview of the Code Parsing and Chunking Strategy, please refer to #528770 (comment 2451039319).)
Currently, the gitlab-elasticsearch-indexer uses a SizeChunker that chunks code content by size. In this issue, we will replace it with the Code Chunker in the gitlab-code-parser, which exposes a Go package that can be used in the gitlab-elasticsearch-indexer
Current Process Flow
-
gitlab-elasticsearch-indexerhas achunkmode indexer implemented ininternal/mode/chunk/chunk.go - in
internal/mode/chunk/chunk.go, we create the SizeChunker and pass it to the ChunkIndexer - we start indexing by calling
ChunkIndexer.PerformIndexing -
ChunkIndexer.PerformIndexingfetches the files from gitaly, and passes them to aBulkDeleteandBulkIndexfunction -
BulkIndexuses the current chunker (a SizeChunker) to chunk the files -
BulkIndexthen passes the chunks to the vectorStoreIndexer to be indexed in the current vector store (elasticsearch)
Prerequisites
- the Go-based gitlab-elasticsearch-indexer should introduce a "chunking" mode, see #536209 (closed)
- the
gitlab-code-parsershould expose its Code Chunker as a Go package, see gitlab-org/rust/gitlab-code-parser!85 (merged)
References
Running the gitlab-elasticsearch-indexer
Planning discussions
- Code Parsing and Chunking Strategy proposal
- Code Embeddings blueprint
- Please refer to the outcome in #536142 (closed)
- Please refer to the Indexer & Chunker contract defined in #528770 (comment 2451039319)
Proposal
- Introduce the
gitlab-code-parser's Code Chunker into thegitlab-elasticsearch-indexer. This should be wrapped as a sub-package in the chunker package, similar to the SizeChunker - In step 2 of the process flow defined above, replace the creation of the Size Chunker with creating the new Code Chunker
- Ensure that Code Chunker outputs all the same
Chunkfields as theSizeChunker, defined in the parentchunkerpackage - Ensure that the output of the Code Chunker can be used in the call to the vectorStoreIndexer
Edited by 🤖 GitLab Bot 🤖