Skip to content

Add CodebaseSearch Tool for Duo Chat

What does this MR do and why?

This MR is part of the Semantic search: chat with your codebase (&16910) initiative.

On a high-level view, the tasks for the entire epic are:

  1. Generate vector embeddings for files in eligible projects/repositories (see blueprint)
  2. On the IDE's Duo Chat panel, allow users to include repositories as additional context (see UI design)
  3. Add a Duo Chat Tool that will perform a semantic search over the vector embeddings of a project, and use the content of the search results to enhance the repository additional context.

This MR addresses the 3rd step (bolded).

Further details

In this MR, we add 2 new components:

  • the CodebaseSearch tool for Duo Chat
  • the Ai::ActiveContext::Queries::Code class for querying code embeddings

About Ai::ActiveContext::Queries::Code

  • is initiated with a search_term and a user
  • exposes a #filter method, which:
    • generates vector embeddings for the given search_term
    • in the event of multiple calls to filter (e.g.: when there are multiple projects), the embeddings generation is called only once
    • the generated embeddings of the search_term is then used to perform a semantic search over the existing vector embeddings of a given project's files
    • the search results are limited to the 10 closest match. We may need to change the limit depending on further evaluations, but for now we will go with 10
  • this makes use of the ActiveContext gem (see ActiveContext blueprint | ActiveContext gem usage guide)

About the CodebaseSearch tool

  1. This is not AI-dependent.
  2. This tool is automatically called when there is a repository additional context.
  3. On execute:
    • the tool calls the Ai::ActiveContext::Queries::Code#filter for each given repository additional context
    • the tool then adds the text content of the result to the content of the repository additional context.

Once the above steps are done, the Chat workflow will move forward and call the ReactExecutor, now with the search results included in the repository additional contexts

Feature Flag

This new tool will only be executed if the repository additional context is present. In turn, the repository additional context can only be added if the duo_include_context_repository FF is enabled.

For this reason, we are not introducing another Feature Flag specifically for the CodebaseSearch tool. We are essentially using the duo_include_context_repository as the toggle for this tool's execution.

Rationale for the Solution

There were initial discussions around using ReAct to let the LLM determine whether the tool needs to be called. However, after further discussions with @dmishunov, we decided that it is best to keep this tool simple as an initial iteration. It is a given that we will need to do a semantic search if the user provides a repository context, so there is no need for LLM to determine whether it's needed. We also don't want to over-engineer this initial solution given that we are also anticipating the move to Agentic Chat.

We may do the following more sophisticated solutions:

  • Incorporate the Codebase Search Tool into the ReAct Agent workflow
  • Use a different Agent specifically for codebase-related context (e.g.: we could introduce different context-enhancement tools for the repository)

Once we have migrated to the Agentic Chat workflow:

  • Make the tool available for Agentic Chat

For further details about this decision, please see:

References

Epics:

Blueprints:

Screenshots or screen recordings

These are example questions about the repository and Duo's answer given the added context of the semantic search results:

Expand for Screenshots

Screenshot_2025-05-26_at_18.11.50

Screenshot_2025-05-26_at_18.12.44

Screenshot_2025-05-26_at_18.13.27

Screenshot_2025-05-26_at_18.13.58

Logs and Traces

See comment thread: !192086 (comment 2531773555)

How to set up and validate locally

Setup

  1. Make sure that ActiveContext is enabled: set both the enabled and indexing_enabled configs in config/initializers/active_context to true.
  2. Make sure AI Gateway is set up and running
  3. Make sure Elasticsearch is set up and running

Generate embeddings for a repository

This tool requires vector embeddings to be generated for a repository. The embeddings pipeline is not yet fully complete. However, you can generate embeddings for a project on your disk by following these steps:

  1. Create an example project on your GDK

  2. Checkout this POC project and run the index_project script:

    ./index_project \
    --project-id=<the_id_from_step_1> \
    --project-directory=<the directory of any project on disk> \
    --es-index="gitlab_active_context_code" \
    --allowed-file-exts="py,md" \
    --output-directory="/Users/pamartiaga/Code/experiments/code-embeddings/indexer_results"

    At this point, we are not yet generating vector embeddings. We are only storing the raw content of the project files to Elasticsearch.

    The actual embeddings pipeline will read files from Gitaly given a project path. This demo script read files from the provided directory. The project_id needs to be an actual project on your GDK since the search results are filtered according to what's accesssible to the current user.

  3. On the Rails monolith, generating the embeddings for the file contents. You can do this by running the following on the Rails console:

    Expand for code
    require 'json'
    
    results_file_path = "/Users/pamartiaga/Code/experiments/code-embeddings/indexer_results/20250526170348.json"
    results_text = File.read(results_file_path)
    
    results = JSON.parse(results_text)
    
    doc_ids = []
    doc_ids += results['created_documents'].pluck('id')
    doc_ids += results['updated_documents'].pluck('id')
    
    def generate_embeddings(ids)
      ::Ai::Context::Collections::Code.track_refs!(routing: "1", hashes: ids)
      ::Ai::ActiveContext::BulkProcessWorker.new.perform("Ai::Context::Queues::Code", 0)
    rescue Exception => e
      puts "Error: #{e.message}"
      puts e.backtrace
      Ai::Context::Queues::Code.clear_tracking!
    end
    
    doc_ids.each_slice(10) do |ids|
      generate_embeddings(ids)
      sleep(2)
    end

Simulate repository additional context input from the Language Server

The Language Server changes to include the repository additional context is not yet done. However, you can simulate a repository additional context by updating the aiAction mutation:

+        # simulate repository additional context
+        additional_context = method_arguments.delete(:additional_context) || []
+        additional_context << {
+          category: 'repository',
+          id: "gid://gitlab/Project/74", # this should be the ID of the project that you generated embeddings for
+          content: '',
+          metadata: {
+            repository_path: 'gitlab-org/modelops/applied-ml/code-suggestions/ai-assist' # project path
+          }
+        }
+        method_arguments[:additional_context] = additional_context
+

Test the Codebase Search Tool

On the Duo Chat panel on your IDE, ask questions about the project, e.g.:

Screenshot_2025-05-26_at_18.11.50

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #537897 (closed)

Edited by Pam Artiaga

Merge request reports

Loading