ActiveContext: embeddings_with_model_redesign preprocessor
What does this MR do and why?
This adds the embeddings_with_model_redesign preprocessor and includes it in the base Reference class.
This is not yet called anywhere in the Code Embeddings pipeline. This is similar to the preload processor where it's included in the base Reference class but not used in References::Code.
The integration of this new preprocessor in the References::Code class is done in ActiveContext: integrate new model design into ... (!222417 - merged).
Step-by-step changes summary
| MR | Status |
|---|---|
| Introduce the new hash/object-based models | |
Add indexing_embedding_fields to Collection class |
|
Add embeddings_with_model_redesign preprocessor |
This MR |
| Add migration to update models metadata | Ready for review |
| Integrate model redesign into Code Embeddings pipeline | Ready for review |
References
Screenshots or screen recordings
N/A
How to set up and validate locally
Since this is not yet integrated into the Code Embeddings pipeline, the unit tests should cover the validation.
However, can also check that this change did not affect the current pipeline:
-
Setup your Code Embeddings Indexing pipeline
-
In
ee/app/services/ai/active_context/code/indexing_service_base.rb, comment out the code insideenqueue_refs!Note: this is for easier verification, allowing you to manually validate the refs processed by the
::Ai::ActiveContext::BulkProcessWorkerwithout the confusion of the same worker automatically picking up queued refs in the background.diff --git a/ee/app/services/ai/active_context/code/indexing_service_base.rb b/ee/app/services/ai/active_context/code/indexing_service_base.rb index 125148469843..afb30d49ac4e 100644 --- a/ee/app/services/ai/active_context/code/indexing_service_base.rb +++ b/ee/app/services/ai/active_context/code/indexing_service_base.rb @@ -34,7 +34,7 @@ def run_indexer!(&block) end def enqueue_refs!(ids) - ::Ai::ActiveContext::Collections::Code.track_refs!(hashes: ids, routing: repository.project_id) + # ::Ai::ActiveContext::Collections::Code.track_refs!(hashes: ids, routing: repository.project_id) end -
Index new code by doing either of the following:
- run initial indexing for a project you have not indexed before
- push new commits to a project that has already gone through initial indexing
-
On the
gitlab_active_context_codeindex, pick one of the chunks created during indexing, and verify that theembeddings_v1field of this chunk should still be empty. -
Manually add the chunk's id/ref to the bulk processing queue:
::Ai::ActiveContext::Collections::Code.track_refs!(routing: "1", hashes: ["4b48fbce868f829cd39d1757dc3937af5d7a56d7dc9973f45d096050b54330dd"]) -
Wait for the
::Ai::ActiveContext::BulkProcessWorkerto process the queued ref, or you can run it manually:::Ai::ActiveContext::BulkProcessWorker.new.perform("Ai::ActiveContext::Queues::Code", 0) -
Check the document on the vector store index and verify that its
embeddings_v1field has been filled. -
For further verification, check the
log/active_context.logand verify that there are no errors related to the embeddings version and processing
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #588847