ActiveContext: Code embedding files (!190313) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

Adds the AI Abstraction Layer files for the first collection: code embeddings.

Changes:

Initializer: enabled set to false
- Ai::ActiveContext::MigrationWorker will not create the partitions until enabled and indexing_enabled is true
- Ai::ActiveContext::BulkProcessWorker will not execute any refs until enabled and indexing_enabled is true
Migration to create code collection using the schema
Migration to set indexing_embedding_versions
Migration to set search_embedding_version
Collections class
- Implements its own redaction logic to return results where the user has read_code ability on the project (the same as for Elastic::FoundBlob
Queue class
- Number of shards set to 1 to have control over rate limits
References class
- Preprocessor to fetch content from vector store
- Preprocessor to generate and set embeddings - one per reference
ContentFetcher preprocessor
- Uses adapter.search to run a passed query and sets the content for a ref
Specs

A reference is tracked as

Ai::Context::Collections::Code.track!({ routing: 1, id: "hash123" })

Or multiple with the same routing

Ai::Context::Collections::Code.track_refs!(routing: 1, hashes: ["hash123", "hash456"])

The idea is that when a git even happens, we know the project, hence the routing. The indexer needs to know the routing as well, so this value is already known. We might need to change this to pass the partition instead because at the moment both the indexer and rails needs to implement the same hashing to convert from routing to partition.

References

How to set up and validate locally

Update the initializer config/initializers/active_context.rb by changing false to true
Create a connection
Run the migration worker: Ai::ActiveContext::MigrationWorker.new.perform
Verify that the partitions were created with the right schema

# on ES
GET gitlab_active_context_code

# on psql
\d gitlab_active_context_code

Verify that the collection record exists and has the right values

ActiveContext.adapter.connection.collections
=> [#<Ai::ActiveContext::Collection:0x0000000168a50c80
  id: 48,
  name: "gitlab_active_context_code",
  metadata: {"collection_class"=>"Ai::Context::Collections::Code", "include_ref_fields"=>false, "indexing_embedding_versions"=>[1]},
  number_of_partitions: 1,
  created_at: Tue, 06 May 2025 12:49:23.504247000 UTC +00:00,
  updated_at: Tue, 06 May 2025 12:50:37.082487000 UTC +00:00,
  connection_id: 10,
  include_ref_fields: false,
  indexing_embedding_versions: [1],
  search_embedding_version: nil,
  collection_class: "Ai::Context::Collections::Code">]

Add some docs to the vector store (this will be done by the indexer in reality, but we can bypass this for review). The only fields we need are:

_id (for ES): hash
id: hash
project_id
content

Track refs for the docs

Ai::Context::Collections::Code.track_refs!(routing: "routing used", hashes: ["hash of doc", "hash of another doc"])

Execute the queue: Ai::ActiveContext::BulkProcessWorker.new.perform("Ai::Context::Queues::Code", 0)
Note that the embeddings field is set for the documents
Run some searches:
1. Find all documents for an admin user: Ai::Context::Collections::Code.search(query: ActiveContext::Query.all, user: User.select{|u| u.admin?}.first)
2. Find all documents for a non-admin user: Ai::Context::Collections::Code.search(query: ActiveContext::Query.all, user: User.reject{|u| u.admin?}.first)
3. Add a filter for project_id: ActiveContext::Query.filter(project_id: 2)
4. KNN search: ActiveContext::Query.knn(content: "some search term", limit: 3)

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #536212 (closed)

Edited May 08, 2025 by Madelein van Niekerk

ActiveContext: Code embedding files

What does this MR do and why?

References

How to set up and validate locally

MR acceptance checklist

Merge request reports