Index issue embeddings in Elasticsearch
Description
An embedding is a compressed numerical representation of a piece of text that captures its meaning and is generated by an ML model. We want to create and store embeddings for issues in Elasticsearch in order to unlock these opportunities:
- Enhance the Global Search issue search by using Hybrid search which combines semantic and exact matching.
- Find duplicate issues
- Find similar issues
Elasticsearch is used to store the vectors because it is a scalable platform and is used by the Global Search feature already.
#440424 (comment 1780797806) explores ways to tune for performance.
Considerations
OpenSearch vs. Elasticsearch
Which model to use
Elasticsearch provides two ways of generating and storing embeddings:
- Host a model on ES (requires a license) and attach an ingestion pipeline to generate the embeddings when a document is indexed
- Generate embeddings outside of ES and pass the vectors along with an indexing operation
We can move from one approach to another but it requires effort.
Current embeddings (for DUO chat) are generated by the Vertex textembedding-gecko@3
model which is accessed by an API. The rate limit of this model is 1500 requests per minute which would take 12+ hours to generate embeddings for every 1 million documents. Elasticsearch provides guidance that it can index 1200 docs/s on 8 vCPUs, 16GB RAM and 1x300GiB SSD disk
for documents containing embeddings which equates to indexing 1mil docs in less than an hour.
Elasticsearch provides a sparse-encoding model called ELSER which is hosted on ES itself and their guidance is that it can index 15 docs/s on 16 vCPUs, 32GB RAM and 2x360GiB local nvme disk
which equates to generating and storing embeddings to 1 million docs in 24 hours.
Model on Elastic | Generating embeddings outside of Elastic | |
---|---|---|
Time to generate and index | ~24 hours per 1 million docs | ~16 hours per million docs |
Pros | Easier to implement | Easier to choose a different embedding model |
Cons | Hardware costs more | Requires implementing rate limiting functionality for generating embeddings in rails |
Requires an Elastic license | AI Gateway doesn't have this endpoint yet | |
Not possible when OpenSearch is used | Parity with OpenSearch is possible (only mapping and search API is different) |
How to index data using the Advanced Search framework
Because generating embeddings is an expensive operation, we want to limit it to only updating when necessary. With the current framework, we send a bulk request which creates a fresh document on every create/update meaning that the embedding will be created on every index operation. There are two options to fix this:
- Change the bulk indexer to use
update
instead ofindex
when the document already exists. Once this is available, we can conditionally update the embeddings only (i.e. when title/description changed). This would require substantial work. - Put embeddings in a separate index that only contains ID, content (title & description) and embedding. This means that we need to do pre-filtering to select which documents are able to be searched based on project, permission, status, etc. which can be done in the main issues index or in postgres. This is an easier option but comes at the cost of adding an additional call. This also gives us the opportunity to put all embeddings in a single index which makes searching across types easier. For example: searching for "anything that relates to upgrading rails" could give back issues, MRs, epics, etc.
Proposal
Given the above considerations, I suggest we generate embeddings outside of Elasticsearch (e.g. with Vertex) and put them in a separate index that contains all embeddings. This allows OpenSearch compatibility and ease of switching to a different model while not impacting the Global Search feature or requiring an Elastic license.
Because ELSER is easy to implement, we will finish the work to demonstrate the value of using Elasticsearch to store vectors by implementing hybrid search on staging using ELSER behind a feature flag.