Create abstraction layer to support Elasticsearch and OpenSearch

Since both OpenSearch and Elasticsearch will be supported for now, we want to create an abstraction layer which selects mappings, search code, etc. code based on whether ES or OS is used.

Solution validation

Where are there diverging paths between Elasticsearch and OpenSearch? Also between different versions of ES/OS.

Index creation
- Specific mappings and/or settings in *Config class or Types:: class
Updating index
- Changing mapping: requires an ES migration which can be skipped and have different mappings
Indexing documents (calling .track!)
- as_indexed_json could be different
- Sometimes track! should not be called if the index doesn't support a ref type
Searching
- Search query could be different
Administration
- Advanced Search admin page has different cluster connection options for OS vs. ES

So basically we have a few places that are likely to diverge:

Mappings/settings during index creation
Mapping updates in migrations
as_indexed_json
Search queries

And then there might be places in code where we need checks for the platform used.

What else needs to be done in order to upgrade/remove the ES gems?

How do we determine which path to serve?

The helper class has some methods around the platform used. For vectors we use Gitlab::Elastic::Helper.default.vectors_supported?(:elasticsearch) which is info[:distribution] == 'elasticsearch' && info[:version].to_f >= 8. Or we could use CurrentSettings.

How do we test on different versions and platforms?

QA tests. We run QA tests on different versions of OS and ES.

We also need to think about blobs/wikis. The json data is determined by the indexer so the indexer also would have diverging paths. We can pass extra options to the run command.

Implementation: inline if-else

Easiest would be to have a few methods in the ES helper similar to vectors_supported? (which should be cached for performance) and we call these methods whenever there is a divergence.

Click to expand for example index mapping

def self.mappings
  properties = {
    type: { type: 'keyword' },
    id: { type: 'integer' },
    ...
  }

  if helper.quantized_vectors_supported?(:elasticsearch)
    properties[:embedding] = {
      type: 'dense_vector',
      dims: 768,
      similarity: 'cosine',
      index: true,
      index_options: {
        type: 'int8_hnsw'
      }
    }
  elsif helper.vectors_supported?(:elasticsearch)
    properties[:embedding] = {
      type: 'dense_vector',
      dims: 768,
      similarity: 'cosine',
      index: true
    }
  elsif helper.vectors_supported?(:opensearch)
    properties[:embedding] = {
      type: 'knn_vector',
      dimension: 768,
      method: {
        name: 'hnsw'
      }
    }
  end

  {
    dynamic: 'strict',
    properties: properties
  }
end

Click to expand for example mapping migration

class AddEmbeddingToIssues < Elastic::Migration
  include Elastic::MigrationUpdateMappingsHelper

  skip_if -> { !Gitlab::Elastic::Helper.default.vectors_supported? }

  DOCUMENT_TYPE = Issue

  private

  def new_mappings
    if helper.quantized_vectors_supported?(:elasticsearch)
      {
          embedding_2: {
          type: 'dense_vector',
          dims: 768,
          similarity: 'cosine',
          index: true,
          index_options: {
            type: 'int8_hnsw'
          }
        }
      }
    elsif helper.vectors_supported?(:elasticsearch)
      {
        embedding_0: {
          type: 'dense_vector',
          dims: 768,
          similarity: 'cosine',
          index: true
        }
      }
    else
      {
        embedding_1: {
          type: 'knn_vector',
          dimension: 768,
          method: {
            name: 'hnsw'
          }
        }
      }
    end
  end
end

Note that every different model/dimension/vector type has a different field name. This is in accordance to #471983 (closed).

Click to expand for example `as_indexed_json`

def as_indexed_json
  data = {
    routing: routing
  }

  if helper.quantized_vectors_supported?(:elasticsearch)
    data["embedding_#{EmbeddingVersion.active.for_type(:elasticsearch, :quantized).id}"] = embedding
  elsif helper.vectors_supported?(:elasticsearch)
    data["embedding_#{EmbeddingVersion.active.for_type(:elasticsearch).id}"] = embedding
  elsif helper.vectors_supported?(:opensearch)
    data["embedding_#{EmbeddingVersion.active.for_type(:opensearch).id}"] = embedding
  end

  data
end

Con: we need to continue supporting older versions so the if statement will continue to grow until we decide to remove support for a version.

Also create an Architecture Design Document.

Edited Aug 16, 2024 by Madelein van Niekerk