GitLab Embeddings Support

Problem to solve

Today, we don't have a canonical way to store–or a standard location for storing–vector embeddings to support AI features like code suggestions, Duo, Docs chat, and semantic search. We also don't have any broadly available infrastructure to support creating and retrieving embeddings for these features. As we continue to evolve features that will depend on Retrieval Augmented Generation (RAG) and semantic search, we'll need to provide access to embedding creation and retrieval services, as well as semantic search endpoints.

Proposal

Overview

Global Search team build the foundations for a RAG architecture with the following attributes:

Elasticsearch as the vector embeddings store
Use a single embedding model (TBD) for all GitLab application content initially, with the option to support multiple embedding models in the future
Storage, retrieval, and search APIs for the embeddings of core GitLab app content, like issues, projects, MRs, comments, etc.
Reranking of search results

Elasticsearch supports language model hosting, which simplifies this architecture somewhat. We can deploy models to the ES cluster using ES APIs, as opposed to having to build out all our own model hosting infrastructure. Note that currently, only CPU-based nodes for model hosting are supported in Elastic by default. To leverage GPUs, we'd have to host our own models, but GPUs aren't required to generate embeddings.

In the future, teams may choose to use different models for specific products–for example, the Docs team may prefer to use a distinct model to generate docs embeddings, or Global Search may want a different model for semantic code search.

Impact and integration

Integration points for other teams would largely consist of APIs for generating, storing, retrieving, and searching vector embeddings. While most GitLab data sources would fall under Global Search's purview to make searchable, we would still provide self-serve APIs and utilities to make self-serve indexing easy for new teams, as we do today.

Initially, it's possible that we may face resource bottlenecks which slow down the speed of the implementation, causing some teams to wait for vector embeddings. However, we plan to talk to teams beforehand to get a clear understanding of which features have the highest business priority and also a need for embeddings

Model selection and evolution

The chosen model should be evaluated as high quality compared to other models. You can see many models tested and evaluated for embeddings quality on HuggingFace's MTEB Leaderboard. Massive Text Embeddings Benchmark (MTEB) is a benchmark for measuring the performance of text embedding models on diverse embedding tasks.

The model does not need to be open source, but we should consider the costs of 3rd party providers. Costs can vary widely for self-hosted and third party options, and in the linked article the cost to embed 1M documents with 44 chunks of 1,000 tokens each on different providers varied from $45 to $17,600, with very high variance on time it would take to complete the task (3 days to 509 days).

In light of the costs and time duration considerations, we'll want to think hard about the amount of data we want to embed and its timeliness. Likely we'll need to experiment with some algorithms for determining initial importance of data by considering a number of factors, such as update frequency, created date, etc.

There will need to be an evaluation pipeline as well, which we would expect to collaborate on with product teams. For example, we could rank the quality of the Duo responses on a scale of 1-5. From there, we can run experiments using various models to determine their capability for a given use case.

Performance and scalability

Scaling will need to be considered seriously at the outset for this project, rather than something we'll grow into. For a RAG and LLM applications to be most useful, they need to operate at appropriately large scale immediately.

We will get a significant performance boost in terms of speed of generating embeddings by using GPU compute. At least for the initial embeddings ingestion, this may be desired. It's possible, depending on frequency of updates desired and the amount, that CPUs could be utilized for updating embeddings.

GPU inference is much faster but only becomes cost-effective above a certain number of inferences per hour. The break-even point found in one experiment was around 520 inferences per hour, below which CPU would be more cost effective. It is unlikely that we'd have less than 520 inferences per hour on average on SaaS, but it's possible that many customers would require less than that after the initial embeddings creation.

For SaaS, there may be additional vendors we want to consider for hosting our AI workloads, such as Anyscale.

Security and privacy considerations

In general, if customers are comfortable with items being indexed into search, they should be comfortable with them being embedded, but there may be edge cases that need to be considered.

In particular, we need to consider how permissions in search would impact permissions for embeddings. In theory, because the embeddings would be indexed with the text fields in Elasticsearch, we might be able to leverage our existing permissions method, but we need clarification on that. We don't want customers seeing search results they shouldn't. In the case where we're not sure what to do, the default answer should be exclude the data from embeddings.

WIP - Iteration plan

This section will evolve as we hear more from R&D, so consider the below an initial proposal:

Stand up a "basic" RAG architecture that includes text chunking and embeddings generation pipelines, indexing, and a vector search endpoint utilizing issues data. We should be able to get relevant answers to question like "Are there any issues for embeddings support?". This is a large first iteration but I don't see any obvious way to shrink it and reach the proposed outcome.
Collaborate with AI team to connect the endpoint to an LLM-based feature, such as chat, so users can ask question about issues and the product can receive context from the vector search endpoint. During this iteration the development focus would be on building and evaluation and experimentation framework to test the quality of responses. Ultimately, the team that owns the LLM feature will decide what to do with it and we'll collaborate to improve the final output.
Depending on previous success, for iterations 4+n, rinse and repeat steps 1., 2., and 3. for epics, MRs, code[1], comments, projects and wikis.

[1] Note: Code would likely require a different embedding model and a more complex chunking strategy. It would also likely require a lot more testing and tuning of results and is unlikely to be a low-effort iteration to get something useful.

Edited Jan 17, 2024 by Ben Venker