ActiveContext: add chunk preprocessor
What does this MR do and why?
Adds support for ActiveSupport references to have multiple documents per reference. A document corresponds to a document in ES or a record in postgres.
One way documents can be created is by using a chunker. A chunker splits content into chunks and then stores each chunk on a document.
The chunking preprocessor should be called with field (the field to populate) and chunk_on (a method that determines what to chunk on, in the example it's a combination of record.title and record.description).
This MR also adds a simple example chunker: chunk by size and overlap. The idea is that feature teams should be able to create chunkers based on their needs, for example using a tree-sitter.
A chunker should implement a chunks method that assumes ref.content is available and then returns an array of chunks. The chunker preprocessor will then assign these chunks to the specified field as documents.
Other changes:
- Updating the
jsonsmethod to usedocumentsif they are set, otherwise fallback toas_indexed_json/as_indexed_jsons - Allow setting shared attributes: key-value pairs that should be present on every document.
The chunk preprocessor can be called as:
class MergeRequest < ::ActiveContext::Reference
CONTENT_METHOD = :title_and_description
CHUNK_FIELD = :content_field
CHUNK_SIZE = 100
OVERLAP = 10
add_preprocessor :preload do |refs|
preload(refs)
end
add_preprocessor :chunking do |refs|
chunker = Chunkers::BySize.new(chunk_size: CHUNK_SIZE, overlap: OVERLAP)
chunk(refs: refs, chunker: chunker, chunk_on: CONTENT_METHOD, field: CHUNK_FIELD)
end
def title_and_description
"Title: #{database_record.title}\n\nDescription: #{database_record.description}"
end
def shared_attributes
{
iid: database_record.iid,
namespace_id: database_record.project.id,
traversal_ids: database_record.project.elastic_namespace_ancestry
}
end
...
This will generate documents containing content on the specified CHUNK_FIELD as well as the shared attributes:
ActiveContext::Reference.preprocess_references([ref]).first.jsons
=> [{:iid=>5, :namespace_id=>2, :traversal_ids=>"24-p2-", :content_field=>"Title: Optimize database queries in user controller\n\nDescription: Reduced N+1 queries in the user co", :ref_id=>"10", :ref_version=>1743163685},
{:iid=>5, :namespace_id=>2, :traversal_ids=>"24-p2-", :content_field=>"he user controller by adding eager loading for user associations. Improves performance by approximat", :ref_id=>"10", :ref_version=>1743163685},
{:iid=>5, :namespace_id=>2, :traversal_ids=>"24-p2-", :content_field=>"approximately 35% on user profile pages.", :ref_id=>"10", :ref_version=>1743163685}]
Each of these jsons will be indexed.
In this follow-up MR, we are handling generating embeddings for documents, because at the moment they are generated at a reference level.
References
How to set up and validate locally
- Create the necessary files as per #524341 (closed)
- Add the chunk preprocessor as shown
- Run preprocessing for a ref:
ActiveContext::Reference.preprocess_references([ref]).first.jsons
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #523414 (closed)