Add Semantic Code Search design docs (e8bd4a2e) · Commits · GitLab.com / Content Sites / handbook

content/handbook/engineering/architecture/design-documents/codebase_as_chat_context/_index.md

+1 −1

Original line number	Diff line number	Diff line
		@@ -3,7 +3,7 @@
		# good title can help communicate what the design document is and should be considered
		# as part of any review.
		title: Codebase as Chat Context
		status: proposed
		status: implemented
		creation-date: "2025-04-02"
		authors: [ "@partiaga", "@tgao3701908" ]
		coaches: [ "@jessieay", "@dgruzd" ]

content/handbook/engineering/architecture/design-documents/codebase_as_chat_context/ad_hoc_indexing.md

0 → 100644

+111 −0

Original line number	Diff line number	Diff line
		---
		title: "Semantic Code Search Ad-hoc Indexing"
		description: "Design document for Semantic Code Search Ad-hoc Indexing"
		status: implemented
		creation-date: "2026-06-29"
		authors: [ "@partiaga" ]
		coaches: []
		dris: [ "@wortschi" ]
		owning-stage: "~devops::ai platform"
		toc_hide: true
		---

		<!--

		The canonical place for the latest set of instructions (and the likely source
		of this file) is
		[content/handbook/engineering/architecture/design-documents/_template.md](https://gitlab.com/gitlab-com/content-sites/handbook/-/blob/main/content/handbook/engineering/architecture/design-documents/_template.md).

		Document statuses you can use:

		- "proposed"
		- "accepted"
		- "ongoing"
		- "implemented"
		- "postponed"
		- "rejected"

		-->

		<!-- Design Documents often contain forward-looking statements -->
		<!-- vale gitlab.FutureTense = NO -->

		<!-- This renders the design document header on the detail page, so don't remove it-->
		{{< engineering/design-document-header >}}

		## Overview

		Ad-hoc indexing is a lazy-loading mechanism that automatically triggers initial indexing when a user attempts to perform semantic code search on a project that hasn't been indexed yet.

		### Benefits

		1. Dramatically reduced storage: Only active projects are indexed, reducing storage from 39-118 TB to manageable levels
		1. Cost efficiency: Elasticsearch cluster can be significantly smaller
		1. Scalability: Easier to manage growth incrementally rather than all at once

		### Trade-off

		1. First-access latency: The first semantic code search on a project will be slower as embeddings are generated

		## Execution Flow

		1. A user or AI agent attempts a semantic code search on an unindexed project. This can be done through various tools (MCP or `glab`) that eventually flows to the [Semantic Code Search REST API](semantic_code_search.md#semantic-code-search-on-the-rest-api).
		1. The Semantic Code Search REST API invokes a search on the [`Ai::ActiveContext::Queries::Code` class](./semantic_code_search.md#using-the-activecontext-query)
		1. If the project has not yet been indexed but it's eligible for indexing, `Ai::ActiveContext::Queries::Code` triggers the `Ai::ActiveContext::Code::AdHocIndexingWorker`. This queues an ad-hoc indexing job that will run asynchronously.
		1. `Ai::ActiveContext::Queries::Code` returns a message indicating: `initial indexing has been started, try again in a few minutes`
		1. The Semantic Code Search REST API returns the message to the invoking user or tool.
		1. In a few minutes, when the user or AI agent performs a search on the project, the Semantic Code Search tool or REST API will return relevant search results.

		### `Ai::ActiveContext::Code::AdHocIndexingWorker`

		On perform, `AdHocIndexingWorker` calls `RepositoryIndexWorker.perform_async`. From there, initial indexing will start for the given project.

		For further details on initial indexing and related state management, please see [Index state management](code_embeddings.md#index-state-management).

		### Rate limits

		Ad-hoc indexing is rate-limited per namespace to prevent a single namespace from overwhelming the system with indexing requests.

		- Key: `semantic_code_search_ad_hoc_indexing`
		- Scope: Root namespace (shared across all projects in the namespace)
		- Limit: 10 requests per hour. This is a low rate limit because initial indexing only needs to be done once per project.

		### Sequence Diagram

		```mermaid
		sequenceDiagram
		participant Client@{ "type" : "entity" } as User or MCP tool or glab CLI

		box Rails
		participant API as REST API
		participant ACQuery as Ai::ActiveContext::<br />Queries::Code
		participant ACRepository as Ai::ActiveContext::<br />Code::Repository
		participant ACAdhocIndexingWorker as Ai::ActiveContext::Code::<br />AdHocIndexingWorker
		end

		participant VectorStorage@{ "type" : "database" } as Vector Storage

		Client->>API: performs an API request
		API->>ACQuery: calls 'filter'
		ACQuery->>ACRepository: finds a 'ready' record for the project<br />(this indicates whether a project<br /> has been indexed or not)
		alt record exists for project
		ACRepository->>ACQuery: returns a 'ready' record
		ACQuery->>VectorStorage: performs query
		VectorStorage->>ACQuery: returns result
		ACQuery->>API: returns result
		API->>Client: returns result
		else
		ACRepository->>ACQuery: returns no 'ready' record
		alt project is not eligible for indexing
		ACQuery->>API: returns project is not eligible for indexing
		API->>Client: returns project is not eligible for indexing
		else adhoc indexing runs into rate limits
		ACQuery->>API: returns indexing is rate limited
		API->>Client: returns indexing is rate limited
		else project is eligible and ad-hoc indexing is not rate limited
		ACQuery->>ACAdhocIndexingWorker: triggers asynchronous execution
		ACQuery->>API: returns initial indexing has started
		API->>Client: returns initial indexing has started
		end
		end
		```

content/handbook/engineering/architecture/design-documents/codebase_as_chat_context/code_embeddings.md

+1 −1

Original line number	Diff line number	Diff line
		---
		title: "Code Embeddings"
		status: ongoing
		status: implemented
		creation-date: "2025-04-02"
		authors: [ "@maddievn" ]
		coaches: ["@dgruzd", "@DylanGriffith"]

content/handbook/engineering/architecture/design-documents/codebase_as_chat_context/decisions/001_semantic_code_search.md

0 → 100644

+31 −0

Original line number	Diff line number	Diff line
		---
		title: "ADR-001: Semantic Code Search"
		description: "Decision record for introducing a Semantic Code Search tool for Agentic Duo Chat"
		toc_hide: true
		---

		## Context

		The "Codebase as Chat Context" functionality, which provides a way to semantically search through a codebase, was tightly coupled with Classic Duo Chat. We needed a way to make this available on Agentic Duo Chat.

		## Decision

		1. We decided to expose the semantic code searching feature as a tool on the GitLab MCP (Model Context Protocol) Server.

		Exposing Semantic Code Search through MCP decouples it from the Agentic Duo Chat implementation, allowing for independent evolution of both systems. In addition, other tools and agents can leverage the same MCP interface to access Semantic Code Search, promoting code reuse and consistency.

		1. We decided to call the tool Semantic Code Search, and moving forward we will refer to this feature as such.

		Using the "Semantic Code Search" name for the tool makes its functionality self-explanatory.

		For further details, please refer to the [Semantic Code Search design document](../semantic_code_search.md).

		## Alternative considered: native Agentic Duo Chat tool

		Implement Semantic Code Search as a native tool directly within the Agentic Duo Chat system, without exposing it through the MCP Server.

		We decided against this approach because it would create unnecessary coupling and limit the extensibility of the system.

		## Related Work Item

		- [Semantic Code Search - Agentic Chat Integration](https://gitlab.com/groups/gitlab-org/-/work_items/18193)

content/handbook/engineering/architecture/design-documents/codebase_as_chat_context/decisions/002_reduce_storage_with_adhoc_indexing.md

0 → 100644

+59 −0

Original line number	Diff line number	Diff line
		---
		title: "ADR 002: Reduce Storage Size for Semantic Code Search Clusters with Ad-hoc Indexing"
		description: "Decision record for reducing vector storage requirements for Semantic Code Search"
		toc_hide: true
		---

		## Context

		When planning to scale Semantic Code Search from the `gitlab-org` namespace to all eligible namespaces on GitLab.com, we encountered significant storage challenges with the Elasticsearch cluster we are using for the vector store.

		[Analysis](https://gitlab.com/gitlab-org/gitlab/-/work_items/551852#note_2796177006) of the index for `gitlab-org/gitlab` indicated that `gitlab-org` is estimated to require more than 100TB of storage.

		A [storage distribution analysis](https://gitlab.com/gitlab-org/gitlab/-/work_items/562554#note_2709963740)
		of `gitlab-org/gitlab` surfaced these percentages used up by each field of the index:

		\| Field \| Percentage \|
		\| --------- \| ---------- \|
		\| `_source` (raw documents) \| 74.5% \|
		\| `embeddings_v1` (vector index) \| 22.0% \|
		\| `content` (text index) \| 1.3% \|
		\| Other metadata \| ~2% \|

		The storage requirement was deemed prohibitively expensive and operationally challenging.
		Furthermore, the fields we needed to target for optimization are fields we need to keep or would require significant refactor and engineering effort to drop.

		## Decision

		We decided to introduce Ad-hoc Indexing for Semantic Code Search.

		This is a lazy-loading mechanism that automatically triggers initial indexing when a user attempts to perform semantic code search on a project that hasn't been indexed yet. Instead of a pipeline that indexes all eligible projects upfront, ad-hoc indexing reduces required storage by only initiating indexing when needed.

		For further details, please see the [Ad-hoc Indexing design document](../ad_hoc_indexing.md).

		## Alternatives Considered

		1. Remove the `content` field
		- Stop storing the actual code snippet content in the index to reduce storage.
		- Status: rejected
		- Reasons:
		- The `content` field is essential for Semantic Code Search functionality
		- Removing it would require a major refactor of both the indexing pipeline and the Duo Chat integration
		- Analysis showed that removing `content` only saved ~4% of storage, making the effort not worthwhile
		2. Quantize embeddings
		- Convert 4-byte float embeddings to 1-byte integers using quantization.
		- Status: deferred as a future optimization
		- Reasons for deferral:
		- This would require changes to the embedding model and vector search implementation
		- Potential impact on search quality and relevance
		3. Dynamic Partitions
		- Implement dynamic partition allocation based on actual storage needs.
		- Status: Deferred as a future optimization
		- Reason for deferral:
		- This would require significant engineering effort

		## Related Work Items

		1. [Cluster sizing](https://gitlab.com/gitlab-org/gitlab/-/issues/551852)
		1. [Determine number_of_partitions](https://gitlab.com/gitlab-org/gitlab/-/work_items/562554)
		1. [Ad-hoc indexing epic](https://gitlab.com/groups/gitlab-org/-/epics/19655)