Skip to content

[Go Indexer] Introduce the Code Parsing Chunker

Context

(For an overview of the Code Parsing and Chunking Strategy, please refer to #528770 (comment 2451039319).)

Currently, the gitlab-elasticsearch-indexer uses a SizeChunker that chunks code content by size. In this issue, we will replace it with the Code Chunker in the gitlab-code-parser, which exposes a Go package that can be used in the gitlab-elasticsearch-indexer

Current Process Flow

  1. gitlab-elasticsearch-indexer has a chunk mode indexer implemented in internal/mode/chunk/chunk.go
  2. in internal/mode/chunk/chunk.go, we create the SizeChunker and pass it to the ChunkIndexer
  3. we start indexing by calling ChunkIndexer.PerformIndexing
  4. ChunkIndexer.PerformIndexing fetches the files from gitaly, and passes them to a BulkDelete and BulkIndex function
  5. BulkIndex uses the current chunker (a SizeChunker) to chunk the files
  6. BulkIndex then passes the chunks to the vectorStoreIndexer to be indexed in the current vector store (elasticsearch)

Prerequisites

References

Running the gitlab-elasticsearch-indexer

See guide

Planning discussions

Proposal

  1. Introduce the gitlab-code-parser's Code Chunker into the gitlab-elasticsearch-indexer. This should be wrapped as a sub-package in the chunker package, similar to the SizeChunker
  2. In step 2 of the process flow defined above, replace the creation of the Size Chunker with creating the new Code Chunker
  3. Ensure that Code Chunker outputs all the same Chunk fields as the SizeChunker, defined in the parent chunker package
  4. Ensure that the output of the Code Chunker can be used in the call to the vectorStoreIndexer
Edited by 🤖 GitLab Bot 🤖