Commit e8bd4a2e authored by Pam Artiaga's avatar Pam Artiaga 2️⃣ Committed by Arturo Herrero
Browse files

Add Semantic Code Search design docs

parent 6d6605ee
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -3,7 +3,7 @@
# good title can help communicate what the design document is and should be considered
# as part of any review.
title: Codebase as Chat Context
status: proposed
status: implemented
creation-date: "2025-04-02"
authors: [ "@partiaga", "@tgao3701908" ]
coaches: [ "@jessieay", "@dgruzd" ]
+111 −0
Original line number Diff line number Diff line
---
title: "Semantic Code Search Ad-hoc Indexing"
description: "Design document for Semantic Code Search Ad-hoc Indexing"
status: implemented
creation-date: "2026-06-29"
authors: [ "@partiaga" ]
coaches: []
dris: [ "@wortschi" ]
owning-stage: "~devops::ai platform"
toc_hide: true
---

<!--

The canonical place for the latest set of instructions (and the likely source
of this file) is
[content/handbook/engineering/architecture/design-documents/_template.md](https://gitlab.com/gitlab-com/content-sites/handbook/-/blob/main/content/handbook/engineering/architecture/design-documents/_template.md).

Document statuses you can use:

- "proposed"
- "accepted"
- "ongoing"
- "implemented"
- "postponed"
- "rejected"

-->

<!-- Design Documents often contain forward-looking statements -->
<!-- vale gitlab.FutureTense = NO -->

<!-- This renders the design document header on the detail page, so don't remove it-->
{{< engineering/design-document-header >}}

## Overview

Ad-hoc indexing is a lazy-loading mechanism that automatically triggers initial indexing when a user attempts to perform semantic code search on a project that hasn't been indexed yet.

### Benefits

1. **Dramatically reduced storage**: Only active projects are indexed, reducing storage from 39-118 TB to manageable levels
1. **Cost efficiency**: Elasticsearch cluster can be significantly smaller
1. **Scalability**: Easier to manage growth incrementally rather than all at once

### Trade-off

1. **First-access latency**: The first semantic code search on a project will be slower as embeddings are generated

## Execution Flow

1. A user or AI agent attempts a semantic code search on an unindexed project. This can be done through various tools (MCP or `glab`) that eventually flows to the [Semantic Code Search REST API](semantic_code_search.md#semantic-code-search-on-the-rest-api).
1. The Semantic Code Search REST API invokes a search on the [`Ai::ActiveContext::Queries::Code` class](./semantic_code_search.md#using-the-activecontext-query)
1. If the project has not yet been indexed but it's eligible for indexing, `Ai::ActiveContext::Queries::Code` triggers the `Ai::ActiveContext::Code::AdHocIndexingWorker`. This queues an ad-hoc indexing job that will run asynchronously.
1. `Ai::ActiveContext::Queries::Code` returns a message indicating: `initial indexing has been started, try again in a few minutes`
1. The Semantic Code Search REST API returns the message to the invoking user or tool.
1. In a few minutes, when the user or AI agent performs a search on the project, the Semantic Code Search tool or REST API will return relevant search results.

### `Ai::ActiveContext::Code::AdHocIndexingWorker`

On perform, `AdHocIndexingWorker` calls `RepositoryIndexWorker.perform_async`. From there, initial indexing will start for the given project.

For further details on initial indexing and related state management, please see [Index state management](code_embeddings.md#index-state-management).

### Rate limits

Ad-hoc indexing is rate-limited per namespace to prevent a single namespace from overwhelming the system with indexing requests.

- Key: `semantic_code_search_ad_hoc_indexing`
- Scope: Root namespace (shared across all projects in the namespace)
- Limit: 10 requests per hour. This is a low rate limit because initial indexing only needs to be done once per project.

### Sequence Diagram

```mermaid
sequenceDiagram
  participant Client@{ "type" : "entity" } as User or MCP tool or glab CLI

  box Rails
    participant API as REST API
    participant ACQuery as Ai::ActiveContext::<br />Queries::Code
    participant ACRepository as Ai::ActiveContext::<br />Code::Repository
    participant ACAdhocIndexingWorker as Ai::ActiveContext::Code::<br />AdHocIndexingWorker
  end

  participant VectorStorage@{ "type" : "database" } as Vector Storage

  Client->>API: performs an API request
  API->>ACQuery: calls 'filter'
  ACQuery->>ACRepository: finds a 'ready' record for the project<br />(this indicates whether a project<br /> has been indexed or not)
  alt record exists for project
    ACRepository->>ACQuery: returns a 'ready' record
    ACQuery->>VectorStorage: performs query
    VectorStorage->>ACQuery: returns result
    ACQuery->>API: returns result
    API->>Client: returns result
  else
    ACRepository->>ACQuery: returns no 'ready' record
    alt project is not eligible for indexing
      ACQuery->>API: returns project is not eligible for indexing
      API->>Client: returns project is not eligible for indexing
    else adhoc indexing runs into rate limits
      ACQuery->>API: returns indexing is rate limited
      API->>Client: returns indexing is rate limited
    else project is eligible and ad-hoc indexing is not rate limited
      ACQuery->>ACAdhocIndexingWorker: triggers asynchronous execution
      ACQuery->>API: returns initial indexing has started
      API->>Client: returns initial indexing has started
    end
  end
```
+1 −1
Original line number Diff line number Diff line
---
title: "Code Embeddings"
status: ongoing
status: implemented
creation-date: "2025-04-02"
authors: [ "@maddievn" ]
coaches: ["@dgruzd", "@DylanGriffith"]
+31 −0
Original line number Diff line number Diff line
---
title: "ADR-001: Semantic Code Search"
description: "Decision record for introducing a Semantic Code Search tool for Agentic Duo Chat"
toc_hide: true
---

## Context

The "Codebase as Chat Context" functionality, which provides a way to semantically search through a codebase, was tightly coupled with Classic Duo Chat. We needed a way to make this available on Agentic Duo Chat.

## Decision

1. We decided to expose the semantic code searching feature as a tool on the **GitLab MCP (Model Context Protocol) Server**.

   Exposing Semantic Code Search through MCP decouples it from the Agentic Duo Chat implementation, allowing for independent evolution of both systems. In addition, other tools and agents can leverage the same MCP interface to access Semantic Code Search, promoting code reuse and consistency.

1. We decided to call the tool **Semantic Code Search**, and moving forward we will refer to this feature as such.

   Using the "Semantic Code Search" name for the tool makes its functionality self-explanatory.

For further details, please refer to the [Semantic Code Search design document](../semantic_code_search.md).

## Alternative considered: native Agentic Duo Chat tool

Implement Semantic Code Search as a native tool directly within the Agentic Duo Chat system, without exposing it through the MCP Server.

We decided against this approach because it would create unnecessary coupling and limit the extensibility of the system.

## Related Work Item

- [Semantic Code Search - Agentic Chat Integration](https://gitlab.com/groups/gitlab-org/-/work_items/18193)
+59 −0
Original line number Diff line number Diff line
---
title: "ADR 002: Reduce Storage Size for Semantic Code Search Clusters with Ad-hoc Indexing"
description: "Decision record for reducing vector storage requirements for Semantic Code Search"
toc_hide: true
---

## Context

When planning to scale Semantic Code Search from the `gitlab-org` namespace to all eligible namespaces on GitLab.com, we encountered significant storage challenges with the Elasticsearch cluster we are using for the vector store.

[Analysis](https://gitlab.com/gitlab-org/gitlab/-/work_items/551852#note_2796177006) of the index for `gitlab-org/gitlab` indicated that `gitlab-org` is estimated to require more than 100TB of storage.

A [storage distribution analysis](https://gitlab.com/gitlab-org/gitlab/-/work_items/562554#note_2709963740)
of `gitlab-org/gitlab` surfaced these percentages used up by each field of the index:

| Field | Percentage |
| --------- | ---------- |
| `_source` (raw documents) | 74.5% |
| `embeddings_v1` (vector index) | 22.0% |
| `content` (text index) | 1.3% |
| Other metadata | ~2% |

The storage requirement was deemed prohibitively expensive and operationally challenging.
Furthermore, the fields we needed to target for optimization are fields we need to keep or would require significant refactor and engineering effort to drop.

## Decision

We decided to introduce **Ad-hoc Indexing** for Semantic Code Search.

This is a lazy-loading mechanism that automatically triggers initial indexing when a user attempts to perform semantic code search on a project that hasn't been indexed yet. Instead of a pipeline that indexes all eligible projects upfront, ad-hoc indexing reduces required storage by only initiating indexing when needed.

For further details, please see the [Ad-hoc Indexing design document](../ad_hoc_indexing.md).

## Alternatives Considered

1. **Remove the `content` field**
   - Stop storing the actual code snippet content in the index to reduce storage.
   - Status: rejected
   - Reasons:
      - The `content` field is essential for Semantic Code Search functionality
      - Removing it would require a major refactor of both the indexing pipeline and the Duo Chat integration
      - Analysis showed that removing `content` only saved ~4% of storage, making the effort not worthwhile
2. **Quantize embeddings**
   - Convert 4-byte float embeddings to 1-byte integers using quantization.
   - Status: deferred as a future optimization
   - Reasons for deferral:
      - This would require changes to the embedding model and vector search implementation
      - Potential impact on search quality and relevance
3. **Dynamic Partitions**
   - Implement dynamic partition allocation based on actual storage needs.
   - Status: Deferred as a future optimization
   - Reason for deferral:
     - This would require significant engineering effort

## Related Work Items

1. [Cluster sizing](https://gitlab.com/gitlab-org/gitlab/-/issues/551852)
1. [Determine number_of_partitions](https://gitlab.com/gitlab-org/gitlab/-/work_items/562554)
1. [Ad-hoc indexing epic](https://gitlab.com/groups/gitlab-org/-/epics/19655)
Loading