Skip to content

Change BulkIndexer from index to update operation

Madelein van Niekerk requested to merge 442197-any-document-references into master

What does this MR do and why?

Updates the Elastic bulk indexer to use update with doc_as_upsert instead of index. The difference between the two is:

  • index: the index API is used to create a new document or replace an existing document entirely. When you use the index API, you provide the full document, and if a document with the same ID already exists, it will be overwritten with the new content you provide. This means that if you index a document without a specific field, the existing field field will be lost because the document is replaced in its entirety.
  • update with doc_as_upsert: the update API is used to partially update an existing document. It allows you to make changes to specific fields without affecting the rest of the document. When you use the update API with the doc_as_upsert option, Elasticsearch will update the document if it exists or create a new one if it doesn’t. The key difference here is that only the fields you specify in the update will be changed; all other fields will remain intact.

This allows us to specify which fields to index instead of indexing every field on every request and would be particularly useful when we want to keep a field unchanged while updating other fields, e.g. embeddings:

  1. When the document doesn't exist: the first update creates the document with as_indexed_json
  2. Adding embeddings: update the document with embeddings: [...] - this will only add the new field and keep everything else as is
  3. When the database record is updated, don't regenerate embeddings but only update other fields.

This allows us to selectively decide when to update specific fields in Elasticsearch, especially expensive fields.

The change is behind a feature flag [Feature flag] Rollout of `elastic_bulk_indexer... (#452332 - closed) without actors since the bulk indexer is actor-agnostic.

The MR also refactors the bulk indexer by moving some DocumentReference-specific methods into DocumentReference, e.g. index_operation, delete_operation which will enable us to override these methods if needed and to keep the indexer light-weight.

With feature disabled

{"index":{"_index":"gitlab-development-issues","_type":null,"_id":"327","routing":"project_2"}}
{"id":327,"iid":3,...}

With feature enabled

{"update":{"_index":"gitlab-development-issues","_type":null,"_id":"327","routing":"project_2"}}
{"doc":{"id":327,"iid":3,...},"doc_as_upsert":true}

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

  1. Disable the flag: Feature.disable(:elastic_bulk_indexer_use_upsert)
  2. (optional) Create indices from scratch: rake gitlab:elastic:index
  3. Find a document in any index and save the current state of the document, e.g. gitlab-development-issues/_doc/326
  4. Update the document, e.g. Issue.find(326).update(title: "another title")
  5. Process the queue: Elastic::ProcessBookkeepingService.new.execute
  6. Check that the document was updated and that it is searchable
  7. Enable the flag: Feature.enable(:elastic_bulk_indexer_use_upsert)
  8. Repeat steps 2-7

Related to #442197 (closed)

Edited by Madelein van Niekerk

Merge request reports