Opensearch 3.0 compatibility
What does this MR do and why?
Change HNSW engine from nmslib to lucene for OpenSearch after version 2.1.0. This is to ensure compatibility with OpenSearch 3.0 which became unsupported due to nmslib going from being default to being deprecated.
| OpenSearch version |
lucene supported? |
nmslib supported? |
|---|---|---|
| 1.x - 2.1.0 | No | Yes |
| 2.2 - 3 | Yes | Yes |
| 3.x | Yes | No |
- Set engine to
nmslibfor versions <= 2.1.0 - Set engine to
lucenefor versions > 2.1.0 - Reindex when version is > 2.1.0 so that the engine changes to
lucene
We choose lucene as the engine over faiss because faiss does not support filtering while searching, while lucene does and this is needed for the hybrid search feature - we need to apply filters during search and not post-filtering.
Also introduces a reindex task for opensearch instances to make the switch to the lucene engine.
I tested the indexing + searching flow and that hybrid search continues to work when the new engine is used.
Note
Embedding tracking is only enabled on .com currently, so no opensearch customers have existing embeddings and even if they did, the reindex would move embeddings to the new engine.
References
Latest OpenSearch fails during index creation d... (#540086 - closed)
How to set up and validate locally
Important
These steps are destructive as it deletes data in indices
- Run opensearch 3
- Checkout master
- Recreate all indices:
Search::RakeTaskExecutorService.new(logger: ::Gitlab::Elasticsearch::Logger.build).execute(:recreate_index) - Note index creation fails with
nmslib engine is deprecated in OpenSearch - Checkout
540086-change-hnsw-engine - Recreate all indices:
Search::RakeTaskExecutorService.new(logger: ::Gitlab::Elasticsearch::Logger.build).execute(:recreate_index) - Note that the mappings for
embedding_1is lucene:"embedding_1"=>{"type"=>"knn_vector", "dimension"=>768, "method"=>{"engine"=>"lucene", "space_type"=>"cosinesimil", "name"=>"hnsw", "parameters"=>{"ef_construction"=>100, "m"=>16}}}
[Optional] Do the same for opensearch 1 and 2.
- Checkout master
- Recreate all indices:
Search::RakeTaskExecutorService.new(logger: ::Gitlab::Elasticsearch::Logger.build).execute(:recreate_index) - Checkout
540086-change-hnsw-engine - Execute the migration worker and reindex worker on repeat:
Elastic::MigrationWorker.new.performandElasticClusterReindexingCronWorker.new.perform - Note that the mappings for
embedding_1is lucene:"embedding_1"=>{"type"=>"knn_vector", "dimension"=>768, "method"=>{"engine"=>"lucene", "space_type"=>"cosinesimil", "name"=>"hnsw", "parameters"=>{"ef_construction"=>100, "m"=>16}}}
[Optional] Track an embedding
::Search::Elastic::ProcessEmbeddingBookkeepingService.track_embedding!(WorkItem.first)::Search::Elastic::ProcessEmbeddingBookkeepingService.new.execute
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #540086 (closed)
