Skip to content

Search references

Madelein van Niekerk requested to merge 457724-any-reference into master

What does this MR do and why?

Updates the Elasticsearch indexing framework to allow more flexible indexing using references. It ensures compatibility with the existing framework (called Legacy references) while introducing a way for new document types to be added and we will eventually move existing document types over to the new framework.

The process is roughly as follows:

  • A reference is added to a queue
  • An async process reads 1000 references from the queue and uses the information from the reference to perform indexing or deletion in Elasticsearch.

To add a reference to the queue, we use serialization and to process from the queue, we use deserialization.

The main objective is to keep the current DocumentReferences working as-is (now referred to as Legacy references) and to allow newly introduced references to be easy to be customizable. I'm using Vulnerabilities as an example in the description but it hasn't been implemented yet.

Serializing

Using the following as examples:

  1. A new reference type for Vulnerabilities
  2. A legacy reference type for Issues
  3. A legacy reference type defined as a DocumentReference

We will add them to the queue by calling ::Elastic::ProcessBookkeepingService.track!:

Screenshot_2024-04-29_at_17.04.03

tracked_items_encoded":"[[1,\"Issue 23 23 project_676\"],[2,\"Vulnerability|31|project_2\"],[3,\"User 1 1\"]]

Deserializing

Now a cron worker fetches specs from the queue and deserializes them into instances of References:

Screenshot_2024-04-29_at_17.04.11

The refs we are left with after deserialization are:

[
  #<Gitlab::Elastic::DocumentReference:0x0000000336834850 @klass=Issue..., @db_id="23", @es_id="23", @es_parent="project_676">,
  #<Search::Elastic::References::Vulnerability:0x000000033b8b4d20 @klass="Vulnerability", @identifier=31, @routing="project_2">,
  #<Gitlab::Elastic::DocumentReference:0x00000003381d46c0 @klass=User..., @db_id="1", @es_id="1", @es_parent=nil>
]

Every reference subclass then implements the following methods which are used by ProcessBookkeepingService and BulkIndexer:

  • required class methods:
    • preload_indexing_data(refs): used for ActiveRecords that need to be preloaded for N+1
    • index_name
  • required instance methods:
    • identifier
    • operation (:index, :upsert, :delete)
    • as_indexed_json
  • optional instance methods (defaults to nil):
    • routing
    • database_record
    • database_id

After ProcessBookkeepingService.new.execute finished, the items are indexed (or deleted) and we see the following logs:

"message":"indexing_done","reference_class":"Issue","database_id":"23","es_id":"23","routing":"project_676"
"message":"indexing_done","reference_class":"Vulnerability","database_id":31,"es_id":31,"routing":"project_2"
"message":"indexing_done","reference_class":"User","database_id":"1","es_id":"1","routing":null

The file structure is as follows:

ee/lib/search/elastic/
├── reference.rb
├── concerns
│   ├── database_reference.rb
│   └── reference_utils.rb
└──  references
    ├── legacy.rb
    └── vulnerability.rb

ReferenceUtils provides some abstracted methods that are useful such as delimiting, and DatabaseReference can be included for Reference subclasses that have corresponding ActiveRecord objects.

ee/spec/elastic_integration/ingestion_pipeline_spec.rb has some integration tests to make sure any references that are in the queue currently continue to function.

Releasing this change

  1. All current references point to Legacy reference
  2. Add reference for Embeddings
  3. Add reference for WorkItems
  4. Add reference for Vulnerabilities
  5. Add reference for AI Agents
  6. [Later] Move existing references over to new framework (issues, users, etc.)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

  1. Track some items: ::Elastic::ProcessBookkeepingService.track!(Issue.first, Gitlab::Elastic::DocumentReference.new(User, 1, 1))
  2. Process the queue: Elastic::ProcessBookkeepingService.new.execute
  3. [optional] bundle exec rake gitlab:elastic:index to index everything from scratch

Related to #61870

Edited by Madelein van Niekerk

Merge request reports