Index issue embeddings in Elasticsearch

Description

An embedding is a compressed numerical representation of a piece of text that captures its meaning and is generated by an ML model. We want to create and store embeddings for issues in Elasticsearch in order to unlock these opportunities:

Enhance the Global Search issue search by using Hybrid search which combines semantic and exact matching.
Find duplicate issues
Find similar issues

Elasticsearch is used to store the vectors because it is a scalable platform and is used by the Global Search feature already.

#440424 (comment 1780797806) explores ways to tune for performance.

Considerations

OpenSearch vs. Elasticsearch

#439358 (comment 1772372663)

Which model to use

Elasticsearch provides two ways of generating and storing embeddings:

Host a model on ES (requires a license) and attach an ingestion pipeline to generate the embeddings when a document is indexed
Generate embeddings outside of ES and pass the vectors along with an indexing operation

We can move from one approach to another but it requires effort.

Current embeddings (for DUO chat) are generated by the Vertex textembedding-gecko@3 model which is accessed by an API. The rate limit of this model is 1500 requests per minute which would take 12+ hours to generate embeddings for every 1 million documents. Elasticsearch provides guidance that it can index 1200 docs/s on 8 vCPUs, 16GB RAM and 1x300GiB SSD disk for documents containing embeddings which equates to indexing 1mil docs in less than an hour.

Elasticsearch provides a sparse-encoding model called ELSER which is hosted on ES itself and their guidance is that it can index 15 docs/s on 16 vCPUs, 32GB RAM and 2x360GiB local nvme disk which equates to generating and storing embeddings to 1 million docs in 24 hours.

	Model on Elastic	Generating embeddings outside of Elastic
Time to generate and index	~24 hours per 1 million docs	~16 hours per million docs
Pros	Easier to implement	Easier to choose a different embedding model
Cons	Hardware costs more	Requires implementing rate limiting functionality for generating embeddings in rails
	Requires an Elastic license	AI Gateway doesn't have this endpoint yet
	Not possible when OpenSearch is used	Parity with OpenSearch is possible (only mapping and search API is different)

How to index data using the Advanced Search framework

Because generating embeddings is an expensive operation, we want to limit it to only updating when necessary. With the current framework, we send a bulk request which creates a fresh document on every create/update meaning that the embedding will be created on every index operation. There are two options to fix this:

Change the bulk indexer to use update instead of index when the document already exists. Once this is available, we can conditionally update the embeddings only (i.e. when title/description changed). This would require substantial work.
Put embeddings in a separate index that only contains ID, content (title & description) and embedding. This means that we need to do pre-filtering to select which documents are able to be searched based on project, permission, status, etc. which can be done in the main issues index or in postgres. This is an easier option but comes at the cost of adding an additional call. This also gives us the opportunity to put all embeddings in a single index which makes searching across types easier. For example: searching for "anything that relates to upgrading rails" could give back issues, MRs, epics, etc.

Proposal

Given the above considerations, I suggest we generate embeddings outside of Elasticsearch (e.g. with Vertex) and put them in a separate index that contains all embeddings. This allows OpenSearch compatibility and ease of switching to a different model while not impacting the Global Search feature or requiring an Elastic license.

Because ELSER is easy to implement, we will finish the work to demonstrate the value of using Elasticsearch to store vectors by implementing hybrid search on staging using ELSER behind a feature flag.

Edited May 21, 2024 by Madelein van Niekerk