ActiveContext: Skip current model during model backfill
What does this MR do and why?
This MR enables skipping the current embedding model during backfill operations in ActiveContext. When migrating to a new embedding model, we need to backfill embeddings using only the next model, not the current one, to avoid redundant processing.
How it works:
-
Dedicated CodeBackfill Queue - A new
CodeBackfillqueue inherits fromCodeand setspreprocess_optionsto{ next_model_only: true }, ensuring backfill operations only process the next embedding model. The benefit of using a separate queue is that we no longer have a race condition for determining when the backfill is done due to incremental indexing continuing to add refs but now since it's a dedicated queue, we know that when the queue is empty, the backfill is done. -
Options Pipeline - To support passing
next_model_onlythrough the processing chain:-
Preprocessorconcern now accepts and forwards**optionsto preprocessor blocks -
Queueconcern providespreprocess_optionsmethod (defaults to empty hash) -
BulkProcessQueuepassesqueue.preprocess_optionsto preprocessing -
Reference.preprocess_referencesaccepts and forwards options
-
-
Model Filtering -
Reference#indexing_embedding_modelsnow acceptsnext_model_onlyparameter to return only the next model when needed. -
Collection-Agnostic Design -
Collectionconcern has abackfill_queuemethod that defaults to the main queue but can be overridden, making this extensible for other collections. -
BackfillEmbeddings Task - Uses
collection_class.backfill_queueto get the appropriate queue
Why?
During embedding model migrations, we need to backfill the new model without reprocessing the current model. This MR provides a clean, extensible way to handle this by using a dedicated queue with specific preprocessing options.
References
Related to #589327 (closed)
How to test
- Run activation service to switch to a new field (but use the same model)
Ai::ActiveContext::EmbeddingModelActivationService.new(collection_class: Ai::ActiveContext::Collections::Code, model_ref: "text_embedding_005_vertex", dimensions: 768).execute!
- Run task worker once
Ai::ActiveContext::TaskWorker.new.perform
- Verify
embeddings_v2field added to mapping - Run task worker once
Ai::ActiveContext::TaskWorker.new.perform
- Verify items were added to the CodeBackfill queue
::Ai::ActiveContext::Queues::CodeBackfill.queued_items # has refs
::Ai::ActiveContext::Queues::Code.queued_items # no refs
- Execute the queues
ActiveContext.execute_all_queues!
- Verify that embeddings are generated once (will be twice on master for current + next model)
- Verify that
embeddings_v2is populated - Run task worker
Ai::ActiveContext::TaskWorker.new.perform
- Until the backfill task is marked as completed
- Run task worker
Ai::ActiveContext::TaskWorker.new.perform
- Run task worker
Ai::ActiveContext::TaskWorker.new.perform
- Now
embeddings_v1should be nullified