Semantic search: chat with your codebase (#16910) · Epics · Epics · GitLab.org

Semantic search: chat with your codebase

## Executive Summary The **Chat with your Codebase** project aims to bridge a critical functional gap in GitLab's Duo AI offering by enabling users to interact with their entire codebase through natural language queries. This capability will allow developers to more effectively understand, navigate, and plan changes to their repositories - a feature already offered by competing products. The implementation leverages semantic search via code embeddings to retrieve relevant context from repositories, which is then processed by large language models to generate helpful responses. The system will support scoping queries to specific repositories, folders, or files, with awareness of branch-specific code changes to ensure relevance to the developer's current work. The project will be executed through a collaborative _subteam_ effort between three specialized groups: Code Creation (embeddings and LLM integration), Editor Extensions (IDE interfaces), and Global Search (abstraction layer and retrieval systems). ## Background Currently, we don't do a great job of helping customers understand their repository and code base. Duo users can select and ask questions about specific code blocks, and soon they will be able to ask questions of 1 or more files via `/include`. Competitors support a broader aperture - a user can ask questions about an entire repository, or scope the context to multiple folders, multiple files, and portions of code. This functional gap is commonly mentioned by customers, and here's a [recent summary](https://docs.google.com/presentation/d/1oyuqOCzR4wzWa6Llo-EwwHdsTxMetd17X9bPYf-YMHA/edit#slide=id.g32a4294fe40_0_77) of research in this space. ## **Main goals** Help customers understand and navigate their code base, and plan changes. **Target use case** Natural language chat with a repository, to support the goals: * Help users understand the functionality of a repository * Help users navigate and plan changes to a repository Proposing that we focus on chat for the initial MVC. Improving Code Suggestions can be a separate follow-on project. **Example questions** * What is the primary functionality of this repository? * What are the key dependencies for the files in this directory? * What are the most commonly used imports in the repository? * Are there any unused methods in the project? * Which functions are most frequently called in the repository? * Which functions are not called anywhere? * Examples specific to AI Gateway repo: * Summarize the functionality of TestCodeGeneration and describe all the methods in the class. * Where do we set the maximum context window use for code completion? * Starting from the PROVIDER_PROMPT_CLASSES class, what is the flow and relationship between the back end classes when a user uses the Explain Code tool? * If I modify the @generate function, what other files or functions are impacted? * Where is the prompt for code generation? ## Assumptions ### Embeddings / semantic search * The MVC will be supported by semantic search via the embeddings abstraction layer. * The MVC won't have knowledge graph support. * The feature/domain teams will determine how to parse/chunk the customer code base. * [tree-sitter](https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/blob/main/src/common/tree_sitter/languages.ts) could help parse the code snippets/chunks, which are then input to the embeddings model. We'll need to work through the exact implementation pattern here. * The embeddings service will perform the embeddings, and store the output in a vector DB. * The user query will be embedded via the embeddings service, matched against the vector representation of their code, and the match results are returned to be consumed by the LLM. * The embeddings and vector store are all stored remotely - there is no local store of the data. * Hybrid search will be supported in first version of abstraction layer. * Uses both "standard" search and semantic search, then ranks results. Often provides better results than only 1 method. ### File/context exclusion * Currently considering this as nice-to-have but not a prerequisite: allow customers to enforce their [security/privacy policy by controlling the content that is used within Duo.](https://gitlab.com/gitlab-org/gitlab/-/issues/517573) * Workaround options: * Disable Duo for entire project project, to ensure file(s) aren't processed by LLM. * Use self-hosted models deployment to fully control data handling. ### Teams involved * Chat * Code Creation * Editor Extensions * Global Search * AI Frameworks ## MVC Proposal ### User inputs [_UX Design Reference_](https://gitlab.com/gitlab-org/gitlab/-/issues/523960) * The user can submit natural language queries about their repository - see illustrative examples above. * The user can scope their query to a 1 or more repositories, 1 or more folders, or 1 or more files. * The user can scope a combination of these; e.g. 1 folder and 2 files not included in that folder. * The ability to scope to multiple repositories is primarily intended to support a microservice architecture, where a user may be planning across 2 or more repositories. * For follow up questions, the system persists the snippets and files that were returned in the previous response; i.e. don't attempt to return new snippets or files based on a follow-up question. * The user can use `/reset` to remove all context scope, or create a new conversation thread with default scope. * No change to the current default behavior when there is no selected scope. ### System responses & behavior [_UX Design Reference_](https://gitlab.com/gitlab-org/gitlab/-/issues/523960) * The Duo response indicates the primary files that were used to provide the response. * The system doesn't have to use semantic search for every code-related question. We can continue inserting the entire file contents into the prompt/question, if the user includes a single file or multiple files that can be supported within the context window. ### Tier availability and deployment options _Related issue: https://gitlab.com/gitlab-org/gitlab/-/issues/480506+_ **Duo tiers** * Duo Core :white_check_mark: * Duo Pro :white_check_mark: * Duo Enterprise :white_check_mark: **Trial support** * Yes, supported for Duo trials :white_check_mark: **Supported deployment options** * .com :white_check_mark: * Dedicated :white_check_mark: * Self-Managed :white_check_mark: * Self-Hosted: proposed as post-MVC iteration ### **Programming languages** * Support each of the core languages [outlined here](https://docs.gitlab.com/user/project/repository/code_suggestions/#enhanced-suggestions). * These make up the majority of customer adoption/use. * Each of these (except PHP) is supported in [tree-sitter parsing](https://gitlab.com/gitlab-org/editor-extensions/gitlab-lsp/-/blob/main/src/common/tree_sitter/languages.ts). ### IDE extensions * For MVC, this functionality is supported in: * VS Code, Jetbrains, Visual Studio, Eclipse * Neovim is currently Code Suggestions only, thus not supported ### Indexing & updating with code changes _Note: please ensure these requirements are discussed and reviewed with the Global Search team. There are potentially significant impacts to cost and performance._ **User interaction** * When the user is working from a feature branch, responses to user questions should use context from their feature branch, rather than from main. * This ensures Duo can consider the most recent changes, which are likely to be relevant. * This is applicable for a remote feature branch. This isn't applicable for a local feature branch, as we plan to support and manage local changes in a future iteration. **System behavior** * Embeddings context for eligible customers are indexed/updated when a commit is pushed to main or default branch. * An eligible customer is effectively a Duo customer, who has agreed to the relevant data usage terms and has the necessary Duo entitlements. * Embeddings context for eligible customers are indexed/updated when a commit is pushed to a remote feature branch. * An eligible customer is effectively a Duo customer, who has agreed to the relevant data usage terms and has the necessary Duo entitlements. * Do not need to index inactive repositories, which haven't had commits/merges. * For MVC, do not need to sync or manage local changes. This can be considered in a future iteration. ### Latency targets * p95 time to first token: ≤ 4 seconds * Sourced from https://gitlab.com/groups/gitlab-org/-/epics/13866#target ### **Telemetry** * Log the scope of the user query * e.g. scoped to the entire repository * e.g. scoped to 1 or more folders * e.g. scoped to 1 or more files * Log when the chat response included semantic search results ### **Feature flags** * Proposing that we use `group` actor within the supporting feature flag. * This would allow us to index and support specific repositories within enabled groups, and provide the feature set within those projects/repositories. * We could alternatively use a `project` actor if we believe more granularity would be helpful. * Proposing that we don't use a `user` actor because we could then have the need to index and support all or most repositories, but with only a subset of users having access to the feature set. ### **Evaluations** **Early development evaluations** We can use an LLM judge to evaluate responses to a small set of questions. _General questions_ * What is the primary functionality of this repository? * What are the key dependencies for the files in this directory? * What are the most commonly used imports in the repository? * Are there any unused methods in the project? * Which functions are most frequently called in the repository? * Which functions are not called anywhere? _Repository specific questions; e.g. AI Gateway_ * Summarize the functionality of TestCodeGeneration and describe all the methods in the class. * Where do we set the maximum context window use for code completion? * Starting from the Config class, what is the flow and relationship between the back end classes? * If I modify the generate function, what other files or functions are impacted? * Where is the prompt for code generation? <details> <summary> _LLM judge instructions_ </summary> ``` You are an impartial evaluator tasked with assessing the answer of a code explanation tool. The input will be a natural language question about a specific code repository, along with a response from the tool. Follow these steps carefully: 1. Review the provided information Input: <question> {question} </question> Response: <response> {response} </response> 2. Evaluate the question and response based on these aspects a. Is the response easy to understand? b. Does the response accurately answer the question? 3. Provide your evaluation a. Assign a score from 0 to 4 (0 = least effective, 4 = most effective). b. Briefly justify your rating based on the criteria. ``` </details> **Compare quality to existing `/explain` evals** * https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/blob/main/data/config/duo_chat/eval_code_explanation_experiment.example.json?ref_type=heads **Establish new code question & answer evaluations** * Potential dataset: https://github.com/kinesiatricssxilm14/CodeRepoQA ### **Risks** * We need to ensure the user understands the sources that informed Duo's response. * To address this, we are ensuring the response highlights the primary sources used to inform the response. * We may start hitting context window limits more frequently, and need to ensure user's have some understanding of the impact when that happens. * Multi-threaded conversations are one mitigator. * We're also exploring ways for the user to summarize the current conversation to mitigate this risk. We are not making this a requirement within this feature set, but it may increase the need. ## Metrics **Adoption** * % of requests scoped to a repository or directory **Activation & retention** * MAU / billable users **Throughput** * The prior metrics can ladder up to improving: Merge request throughput **Sales objections** * Remove, or significantly reduce, this product gap as a sales objection * This is admittedly more qualitative than quantitative ## Post-MVC iterations ### Iteration 1 * Add support to Web IDE ### Iteration 2 * Add support for Self-hosted * This requires the customer to maintain a vector DB to store embeddings ### Iteration 3 * Add [knowledge graph](https://gitlab.com/gitlab-org/gitlab/-/issues/521966) as additional tool alongside semantic search * Our prototype testing indicated that [different question types benefit from different retrieval methods](https://gitlab.com/gitlab-org/gitlab/-/issues/517365#observations---2025-02-21). Semantic search produced higher quality responses for some question categories, while a knowledge graph performed better for others. ## Implementation Plan Using the information above, here is a proposed implementation plan for discussion. The "**Team Members**" column is meant as a starting point for who I think would be likely to work on each row. Since we are working in a subteam, we do not want to have "silos" of people and the intent is that we can all work together on these tasks. ### Phase 1: Foundation and Infrastructure | Task | Team Members | Description | |------|--------------|-------------| | Define repository parsing strategy | Code Creation | Determine how to parse and chunk repository code for embedding, leveraging tree-sitter for supported languages | | Design embedding schema | Global Search | Define the schema for storing code embeddings including metadata for file paths, repository context, and version information | | Implement repository indexing trigger system | Code Creation + Global Search | Build system to trigger indexing when commits are pushed to main branch or feature branches | | Design semantic search integration | Global Search | Design the integration pattern for querying the embedding abstraction layer with natural language | | Create repository scoping UI mock-ups | Editor Extensions | Design UI components for selecting repository, folder, and file scopes within IDEs | | Establish evaluation framework | Code Creation | Set up automated evaluation pipeline using proposed LLM judge and example questions | ### Phase 2: Core Functionality Development | Task | Team Members | Description | |------|--------------|-------------| | Implement code parsing and chunking | Code Creation | Build the code parser that segments repositories into semantic chunks for embedding | | Build embedding generation pipeline | Code Creation + Global Search | Create the pipeline that processes code chunks and generates embeddings via the abstraction layer | | Develop semantic search query mechanism | Global Search | Implement the query mechanism that translates user questions into embedding searches | | Implement repository scope selection UI | Editor Extensions | Build UI components for repository/folder/file selection in IDEs | | Create repository metadata indexing | Global Search | Index repository structure metadata to support navigation questions | | Build prompt engineering templates | Code Creation | Create prompts that effectively combine user questions with retrieved code context | ### Phase 3: Integration and User Experience | Task | Team Members | Description | |------|--------------|-------------| | Integrate semantic search with chat | Global Search + Code Creation | Connect the embedding search results to the LLM chat interface | | Implement citation and reference system | Code Creation | Create system to highlight and reference source files in chat responses | | Add branch-awareness to indexing | Global Search | Enable indexing and querying against specific branches rather than just main | | Implement IDE extension integration | Editor Extensions | Integrate the repository chat functionality into VS Code, JetBrains IDEs, etc. | | Build context persistence system | Code Creation | Implement system to maintain conversation context including previously referenced files | | Create command handlers for scope control | Editor Extensions | Implement `/reset` and other commands for managing conversation scope | ### Phase 4: Optimization and Launch Preparation | Task | Team Members | Description | |------|--------------|-------------| | Optimize embedding retrieval for latency | Global Search | Tune retrieval systems to meet the 4-second p95 time to first token requirement | | Implement feature flags | Global Search | Set up group-based feature flags for controlled rollout | | Add telemetry for scoping and usage | Global Search + Editor Extensions | Implement logging for scope selection and semantic search usage | | User acceptance testing across IDEs | Editor Extensions | Test functionality across all supported IDE environments | | Performance testing across repository sizes | Global Search | Test with varying repository sizes to ensure acceptable performance | | Final evaluation against benchmark questions | Code Creation | Run final evaluations against the defined question sets and compare to existing functionality | ### Post-MVC Planning | Task | Team Members | Description | |------|--------------|-------------| | Web IDE support planning | Editor Extensions | Design implementation plan for Web IDE support | ### Sub-team members * Anna Springfield (Editor Extensions) * John Slaughter (Editor Extensions) * Maddie van Niekerk (Global Search) * Pam Artiaga (Code Creation) * Tian Gao (Code Creation) * Jordan Janes PM (Code Creation) * Matt Nohr EM (Code Creation)

epic