ActiveContext: add chunk preprocessor

What does this MR do and why?

Adds support for ActiveSupport references to have multiple documents per reference. A document corresponds to a document in ES or a record in postgres.

One way documents can be created is by using a chunker. A chunker splits content into chunks and then stores each chunk on a document.

The chunking preprocessor should be called with field (the field to populate) and chunk_on (a method that determines what to chunk on, in the example it's a combination of record.title and record.description).

This MR also adds a simple example chunker: chunk by size and overlap. The idea is that feature teams should be able to create chunkers based on their needs, for example using a tree-sitter.

A chunker should implement a chunks method that assumes ref.content is available and then returns an array of chunks. The chunker preprocessor will then assign these chunks to the specified field as documents.

Other changes:

  • Updating the jsons method to use documents if they are set, otherwise fallback to as_indexed_json / as_indexed_jsons
  • Allow setting shared attributes: key-value pairs that should be present on every document.

The chunk preprocessor can be called as:

class MergeRequest < ::ActiveContext::Reference
  CONTENT_METHOD = :title_and_description
  CHUNK_FIELD = :content_field
  CHUNK_SIZE = 100
  OVERLAP = 10

  add_preprocessor :preload do |refs|
    preload(refs)
  end

  add_preprocessor :chunking do |refs|
    chunker = Chunkers::BySize.new(chunk_size: CHUNK_SIZE, overlap: OVERLAP)
    chunk(refs: refs, chunker: chunker, chunk_on: CONTENT_METHOD, field: CHUNK_FIELD)
  end

  def title_and_description
    "Title: #{database_record.title}\n\nDescription: #{database_record.description}"
  end

  def shared_attributes
    {
      iid: database_record.iid,
      namespace_id: database_record.project.id,
      traversal_ids: database_record.project.elastic_namespace_ancestry
    }
  end
  ...

This will generate documents containing content on the specified CHUNK_FIELD as well as the shared attributes:

ActiveContext::Reference.preprocess_references([ref]).first.jsons
=> [{:iid=>5, :namespace_id=>2, :traversal_ids=>"24-p2-", :content_field=>"Title: Optimize database queries in user controller\n\nDescription: Reduced N+1 queries in the user co", :ref_id=>"10", :ref_version=>1743163685},
 {:iid=>5, :namespace_id=>2, :traversal_ids=>"24-p2-", :content_field=>"he user controller by adding eager loading for user associations. Improves performance by approximat", :ref_id=>"10", :ref_version=>1743163685},
 {:iid=>5, :namespace_id=>2, :traversal_ids=>"24-p2-", :content_field=>"approximately 35% on user profile pages.", :ref_id=>"10", :ref_version=>1743163685}]

Each of these jsons will be indexed.

In this follow-up MR, we are handling generating embeddings for documents, because at the moment they are generated at a reference level.

References

How to set up and validate locally

  1. Create the necessary files as per #524341 (closed)
  2. Add the chunk preprocessor as shown
  3. Run preprocessing for a ref:
ActiveContext::Reference.preprocess_references([ref]).first.jsons

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #523414 (closed)

Edited by Madelein van Niekerk

Merge request reports

Loading