<!-- Design Documents often contain forward-looking statements -->
<!-- vale gitlab.FutureTense = NO -->
<!-- This renders the design document header on the detail page, so don't remove it-->
{{<engineering/design-document-header>}}
## Overview
Ad-hoc indexing is a lazy-loading mechanism that automatically triggers initial indexing when a user attempts to perform semantic code search on a project that hasn't been indexed yet.
### Benefits
1.**Dramatically reduced storage**: Only active projects are indexed, reducing storage from 39-118 TB to manageable levels
1.**Cost efficiency**: Elasticsearch cluster can be significantly smaller
1.**Scalability**: Easier to manage growth incrementally rather than all at once
### Trade-off
1.**First-access latency**: The first semantic code search on a project will be slower as embeddings are generated
## Execution Flow
1. A user or AI agent attempts a semantic code search on an unindexed project. This can be done through various tools (MCP or `glab`) that eventually flows to the [Semantic Code Search REST API](semantic_code_search.md#semantic-code-search-on-the-rest-api).
1. The Semantic Code Search REST API invokes a search on the [`Ai::ActiveContext::Queries::Code` class](./semantic_code_search.md#using-the-activecontext-query)
1. If the project has not yet been indexed but it's eligible for indexing, `Ai::ActiveContext::Queries::Code` triggers the `Ai::ActiveContext::Code::AdHocIndexingWorker`. This queues an ad-hoc indexing job that will run asynchronously.
1.`Ai::ActiveContext::Queries::Code` returns a message indicating: `initial indexing has been started, try again in a few minutes`
1. The Semantic Code Search REST API returns the message to the invoking user or tool.
1. In a few minutes, when the user or AI agent performs a search on the project, the Semantic Code Search tool or REST API will return relevant search results.
The "Codebase as Chat Context" functionality, which provides a way to semantically search through a codebase, was tightly coupled with Classic Duo Chat. We needed a way to make this available on Agentic Duo Chat.
## Decision
1. We decided to expose the semantic code searching feature as a tool on the **GitLab MCP (Model Context Protocol) Server**.
Exposing Semantic Code Search through MCP decouples it from the Agentic Duo Chat implementation, allowing for independent evolution of both systems. In addition, other tools and agents can leverage the same MCP interface to access Semantic Code Search, promoting code reuse and consistency.
1. We decided to call the tool **Semantic Code Search**, and moving forward we will refer to this feature as such.
Using the "Semantic Code Search" name for the tool makes its functionality self-explanatory.
For further details, please refer to the [Semantic Code Search design document](../semantic_code_search.md).
## Alternative considered: native Agentic Duo Chat tool
Implement Semantic Code Search as a native tool directly within the Agentic Duo Chat system, without exposing it through the MCP Server.
We decided against this approach because it would create unnecessary coupling and limit the extensibility of the system.
When planning to scale Semantic Code Search from the `gitlab-org` namespace to all eligible namespaces on GitLab.com, we encountered significant storage challenges with the Elasticsearch cluster we are using for the vector store.
[Analysis](https://gitlab.com/gitlab-org/gitlab/-/work_items/551852#note_2796177006) of the index for `gitlab-org/gitlab` indicated that `gitlab-org` is estimated to require more than 100TB of storage.
A [storage distribution analysis](https://gitlab.com/gitlab-org/gitlab/-/work_items/562554#note_2709963740)
of `gitlab-org/gitlab` surfaced these percentages used up by each field of the index:
| Field | Percentage |
| --------- | ---------- |
| `_source` (raw documents) | 74.5% |
| `embeddings_v1` (vector index) | 22.0% |
| `content` (text index) | 1.3% |
| Other metadata | ~2% |
The storage requirement was deemed prohibitively expensive and operationally challenging.
Furthermore, the fields we needed to target for optimization are fields we need to keep or would require significant refactor and engineering effort to drop.
## Decision
We decided to introduce **Ad-hoc Indexing** for Semantic Code Search.
This is a lazy-loading mechanism that automatically triggers initial indexing when a user attempts to perform semantic code search on a project that hasn't been indexed yet. Instead of a pipeline that indexes all eligible projects upfront, ad-hoc indexing reduces required storage by only initiating indexing when needed.
For further details, please see the [Ad-hoc Indexing design document](../ad_hoc_indexing.md).
## Alternatives Considered
1.**Remove the `content` field**
- Stop storing the actual code snippet content in the index to reduce storage.
- Status: rejected
- Reasons:
- The `content` field is essential for Semantic Code Search functionality
- Removing it would require a major refactor of both the indexing pipeline and the Duo Chat integration
- Analysis showed that removing `content` only saved ~4% of storage, making the effort not worthwhile
2.**Quantize embeddings**
- Convert 4-byte float embeddings to 1-byte integers using quantization.
- Status: deferred as a future optimization
- Reasons for deferral:
- This would require changes to the embedding model and vector search implementation
- Potential impact on search quality and relevance
3.**Dynamic Partitions**
- Implement dynamic partition allocation based on actual storage needs.
- Status: Deferred as a future optimization
- Reason for deferral:
- This would require significant engineering effort