ActiveContext: Code embedding files

What does this MR do and why?

Adds the AI Abstraction Layer files for the first collection: code embeddings.

Changes:

  • Initializer: enabled set to false
    • Ai::ActiveContext::MigrationWorker will not create the partitions until enabled and indexing_enabled is true
    • Ai::ActiveContext::BulkProcessWorker will not execute any refs until enabled and indexing_enabled is true
  • Migration to create code collection using the schema
  • Migration to set indexing_embedding_versions
  • Migration to set search_embedding_version
  • Collections class
    • Implements its own redaction logic to return results where the user has read_code ability on the project (the same as for Elastic::FoundBlob
  • Queue class
    • Number of shards set to 1 to have control over rate limits
  • References class
    • Preprocessor to fetch content from vector store
    • Preprocessor to generate and set embeddings - one per reference
  • ContentFetcher preprocessor
    • Uses adapter.search to run a passed query and sets the content for a ref
  • Specs

A reference is tracked as

Ai::Context::Collections::Code.track!({ routing: 1, id: "hash123" })

Or multiple with the same routing

Ai::Context::Collections::Code.track_refs!(routing: 1, hashes: ["hash123", "hash456"])

The idea is that when a git even happens, we know the project, hence the routing. The indexer needs to know the routing as well, so this value is already known. We might need to change this to pass the partition instead because at the moment both the indexer and rails needs to implement the same hashing to convert from routing to partition.

References

How to set up and validate locally

  1. Update the initializer config/initializers/active_context.rb by changing false to true
  2. Create a connection
  3. Run the migration worker: Ai::ActiveContext::MigrationWorker.new.perform
  4. Verify that the partitions were created with the right schema
# on ES
GET gitlab_active_context_code

# on psql
\d gitlab_active_context_code
  1. Verify that the collection record exists and has the right values
ActiveContext.adapter.connection.collections
=> [#<Ai::ActiveContext::Collection:0x0000000168a50c80
  id: 48,
  name: "gitlab_active_context_code",
  metadata: {"collection_class"=>"Ai::Context::Collections::Code", "include_ref_fields"=>false, "indexing_embedding_versions"=>[1]},
  number_of_partitions: 1,
  created_at: Tue, 06 May 2025 12:49:23.504247000 UTC +00:00,
  updated_at: Tue, 06 May 2025 12:50:37.082487000 UTC +00:00,
  connection_id: 10,
  include_ref_fields: false,
  indexing_embedding_versions: [1],
  search_embedding_version: nil,
  collection_class: "Ai::Context::Collections::Code">]
  1. Add some docs to the vector store (this will be done by the indexer in reality, but we can bypass this for review). The only fields we need are:
_id (for ES): hash
id: hash
project_id
content
  1. Track refs for the docs
Ai::Context::Collections::Code.track_refs!(routing: "routing used", hashes: ["hash of doc", "hash of another doc"])
  1. Execute the queue: Ai::ActiveContext::BulkProcessWorker.new.perform("Ai::Context::Queues::Code", 0)
  2. Note that the embeddings field is set for the documents
  3. Run some searches:
    1. Find all documents for an admin user: Ai::Context::Collections::Code.search(query: ActiveContext::Query.all, user: User.select{|u| u.admin?}.first)
    2. Find all documents for a non-admin user: Ai::Context::Collections::Code.search(query: ActiveContext::Query.all, user: User.reject{|u| u.admin?}.first)
    3. Add a filter for project_id: ActiveContext::Query.filter(project_id: 2)
    4. KNN search: ActiveContext::Query.knn(content: "some search term", limit: 3)

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #536212 (closed)

Edited by Madelein van Niekerk

Merge request reports

Loading