ActiveContext: embeddings_with_model_redesign preprocessor

What does this MR do and why?

This adds the embeddings_with_model_redesign preprocessor and includes it in the base Reference class.

This is not yet called anywhere in the Code Embeddings pipeline. This is similar to the preload processor where it's included in the base Reference class but not used in References::Code.

The integration of this new preprocessor in the References::Code class is done in ActiveContext: integrate new model design into ... (!222417 - merged).

Step-by-step changes summary

MR Status
Introduce the new hash/object-based models
Add indexing_embedding_fields to Collection class
Add embeddings_with_model_redesign preprocessor This MR
Add migration to update models metadata Ready for review
Integrate model redesign into Code Embeddings pipeline Ready for review

References

Screenshots or screen recordings

N/A

How to set up and validate locally

Since this is not yet integrated into the Code Embeddings pipeline, the unit tests should cover the validation.

However, can also check that this change did not affect the current pipeline:

  1. Setup your Code Embeddings Indexing pipeline

  2. In ee/app/services/ai/active_context/code/indexing_service_base.rb, comment out the code inside enqueue_refs!

    Note: this is for easier verification, allowing you to manually validate the refs processed by the ::Ai::ActiveContext::BulkProcessWorker without the confusion of the same worker automatically picking up queued refs in the background.

    diff --git a/ee/app/services/ai/active_context/code/indexing_service_base.rb b/ee/app/services/ai/active_context/code/indexing_service_base.rb
    index 125148469843..afb30d49ac4e 100644
    --- a/ee/app/services/ai/active_context/code/indexing_service_base.rb
    +++ b/ee/app/services/ai/active_context/code/indexing_service_base.rb
    @@ -34,7 +34,7 @@ def run_indexer!(&block)
            end
    
            def enqueue_refs!(ids)
    -          ::Ai::ActiveContext::Collections::Code.track_refs!(hashes: ids, routing: repository.project_id)
    +          # ::Ai::ActiveContext::Collections::Code.track_refs!(hashes: ids, routing: repository.project_id)
            end
  3. Index new code by doing either of the following:

    • run initial indexing for a project you have not indexed before
    • push new commits to a project that has already gone through initial indexing
  4. On the gitlab_active_context_code index, pick one of the chunks created during indexing, and verify that the embeddings_v1 field of this chunk should still be empty.

  5. Manually add the chunk's id/ref to the bulk processing queue:

    ::Ai::ActiveContext::Collections::Code.track_refs!(routing: "1", hashes: ["4b48fbce868f829cd39d1757dc3937af5d7a56d7dc9973f45d096050b54330dd"])
  6. Wait for the ::Ai::ActiveContext::BulkProcessWorker to process the queued ref, or you can run it manually:

    ::Ai::ActiveContext::BulkProcessWorker.new.perform("Ai::ActiveContext::Queues::Code", 0)
  7. Check the document on the vector store index and verify that its embeddings_v1 field has been filled.

  8. For further verification, check the log/active_context.log and verify that there are no errors related to the embeddings version and processing

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #588847

Edited by Pam Artiaga

Merge request reports

Loading