Opensearch 3.0 compatibility
What does this MR do and why?
Change HNSW engine from nmslib
to lucene
for OpenSearch after version 2.1.0
. This is to ensure compatibility with OpenSearch 3.0 which became unsupported due to nmslib
going from being default to being deprecated.
OpenSearch version |
lucene supported? |
nmslib supported? |
---|---|---|
1.x - 2.1.0 | No | Yes |
2.2 - 3 | Yes | Yes |
3.x | Yes | No |
- Set engine to
nmslib
for versions <= 2.1.0 - Set engine to
lucene
for versions > 2.1.0 - Reindex when version is > 2.1.0 so that the engine changes to
lucene
We choose lucene
as the engine over faiss
because faiss
does not support filtering while searching, while lucene
does and this is needed for the hybrid search feature - we need to apply filters during search and not post-filtering.
Also introduces a reindex task for opensearch instances to make the switch to the lucene engine.
I tested the indexing + searching flow and that hybrid search continues to work when the new engine is used.
Note
Embedding tracking is only enabled on .com currently, so no opensearch customers have existing embeddings and even if they did, the reindex would move embeddings to the new engine.
References
Latest OpenSearch fails during index creation d... (#540086 - closed)
How to set up and validate locally
Important
These steps are destructive as it deletes data in indices
- Run opensearch 3
- Checkout master
- Recreate all indices:
Search::RakeTaskExecutorService.new(logger: ::Gitlab::Elasticsearch::Logger.build).execute(:recreate_index)
- Note index creation fails with
nmslib engine is deprecated in OpenSearch
- Checkout
540086-change-hnsw-engine
- Recreate all indices:
Search::RakeTaskExecutorService.new(logger: ::Gitlab::Elasticsearch::Logger.build).execute(:recreate_index)
- Note that the mappings for
embedding_1
is lucene:"embedding_1"=>{"type"=>"knn_vector", "dimension"=>768, "method"=>{"engine"=>"lucene", "space_type"=>"cosinesimil", "name"=>"hnsw", "parameters"=>{"ef_construction"=>100, "m"=>16}}}
[Optional] Do the same for opensearch 1 and 2.
- Checkout master
- Recreate all indices:
Search::RakeTaskExecutorService.new(logger: ::Gitlab::Elasticsearch::Logger.build).execute(:recreate_index)
- Checkout
540086-change-hnsw-engine
- Execute the migration worker and reindex worker on repeat:
Elastic::MigrationWorker.new.perform
andElasticClusterReindexingCronWorker.new.perform
- Note that the mappings for
embedding_1
is lucene:"embedding_1"=>{"type"=>"knn_vector", "dimension"=>768, "method"=>{"engine"=>"lucene", "space_type"=>"cosinesimil", "name"=>"hnsw", "parameters"=>{"ef_construction"=>100, "m"=>16}}}
[Optional] Track an embedding
::Search::Elastic::ProcessEmbeddingBookkeepingService.track_embedding!(WorkItem.first)
::Search::Elastic::ProcessEmbeddingBookkeepingService.new.execute
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #540086 (closed)