ActiveContext: Consistent error handling for preprocessors
What does this MR do and why?
Adds better error handling for ActiveContext preprocessors.
Instead of always returning the refs, we now return { successful:, failed: } and only continue to index the successes and re-enqueue the failures for retry.
It also only passes successful refs from one preprocessor to the next. E.g. if the first preprocessor failed for a ref, we shouldn't run the second preprocessor.
We introduce two new methods that can be called with refs and a block:
with_per_ref_handlingwith_batch_handling
with_per_ref_handling: executes the block on every ref and if one ref fails, adds to failed and continues with the rest. Useful for per-ref operations.
with_batch_handling: executes the block on all the refs at once. If it fails, it sets all refs as failed.
Preprocessors
- chunking preprocessor:
- fails all refs if the
chunksmethod is not defined - fails individual ref if there's an error with the chunk process
- fails all refs if the
- embedding preprocessor:
- fails all refs if there's an error bulk generating embeddings
- also refactors the preprocessor to build up bulk embeddings when there's only one document (previously we did bulk generation if a ref contains multiple docs, now we collect all documents from all refs and process them in batches)
- preload preprocessor:
- fails all refs if the
preload_indexing_datamethod is not defined - fails individual ref if a corresponding database record can't be found
- fails all refs if the
References
- Draft: Code embedding files using ActiveContext (!189310 - closed)
- [Embedding indexing pipeline] Reference class (#536212 - closed)
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #536212 (closed)