ActiveContext: Code embedding files
What does this MR do and why?
Adds the AI Abstraction Layer files for the first collection: code embeddings.
Changes:
- Initializer: enabled set to false
-
Ai::ActiveContext::MigrationWorkerwill not create the partitions untilenabledandindexing_enabledis true -
Ai::ActiveContext::BulkProcessWorkerwill not execute any refs untilenabledandindexing_enabledis true
-
- Migration to create code collection using the schema
- Migration to set
indexing_embedding_versions - Migration to set
search_embedding_version - Collections class
- Implements its own redaction logic to return results where the user has
read_codeability on the project (the same as forElastic::FoundBlob
- Implements its own redaction logic to return results where the user has
- Queue class
- Number of shards set to 1 to have control over rate limits
- References class
- Preprocessor to fetch content from vector store
- Preprocessor to generate and set embeddings - one per reference
- ContentFetcher preprocessor
- Uses
adapter.searchto run a passed query and sets the content for a ref
- Uses
- Specs
A reference is tracked as
Ai::Context::Collections::Code.track!({ routing: 1, id: "hash123" })
Or multiple with the same routing
Ai::Context::Collections::Code.track_refs!(routing: 1, hashes: ["hash123", "hash456"])
The idea is that when a git even happens, we know the project, hence the routing. The indexer needs to know the routing as well, so this value is already known. We might need to change this to pass the partition instead because at the moment both the indexer and rails needs to implement the same hashing to convert from routing to partition.
References
- Draft: Code embedding files using ActiveContext (!189310 - closed)
- [Embedding indexing pipeline] Reference class (#536212 - closed)
How to set up and validate locally
- Update the initializer
config/initializers/active_context.rbby changingfalsetotrue - Create a connection
- Run the migration worker:
Ai::ActiveContext::MigrationWorker.new.perform - Verify that the partitions were created with the right schema
# on ES
GET gitlab_active_context_code
# on psql
\d gitlab_active_context_code
- Verify that the collection record exists and has the right values
ActiveContext.adapter.connection.collections
=> [#<Ai::ActiveContext::Collection:0x0000000168a50c80
id: 48,
name: "gitlab_active_context_code",
metadata: {"collection_class"=>"Ai::Context::Collections::Code", "include_ref_fields"=>false, "indexing_embedding_versions"=>[1]},
number_of_partitions: 1,
created_at: Tue, 06 May 2025 12:49:23.504247000 UTC +00:00,
updated_at: Tue, 06 May 2025 12:50:37.082487000 UTC +00:00,
connection_id: 10,
include_ref_fields: false,
indexing_embedding_versions: [1],
search_embedding_version: nil,
collection_class: "Ai::Context::Collections::Code">]
- Add some docs to the vector store (this will be done by the indexer in reality, but we can bypass this for review). The only fields we need are:
_id (for ES): hash
id: hash
project_id
content
- Track refs for the docs
Ai::Context::Collections::Code.track_refs!(routing: "routing used", hashes: ["hash of doc", "hash of another doc"])
- Execute the queue:
Ai::ActiveContext::BulkProcessWorker.new.perform("Ai::Context::Queues::Code", 0) - Note that the embeddings field is set for the documents
- Run some searches:
- Find all documents for an admin user:
Ai::Context::Collections::Code.search(query: ActiveContext::Query.all, user: User.select{|u| u.admin?}.first) - Find all documents for a non-admin user:
Ai::Context::Collections::Code.search(query: ActiveContext::Query.all, user: User.reject{|u| u.admin?}.first) - Add a filter for
project_id:ActiveContext::Query.filter(project_id: 2) - KNN search:
ActiveContext::Query.knn(content: "some search term", limit: 3)
- Find all documents for an admin user:
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #536212 (closed)
Edited by Madelein van Niekerk