Spike: Investigate and Validate Path for Privacy-Oriented Embedding Models

Summary - Why is this Spike needed?

As GitLab moves forwards on offering customized features to its customers, leveraging semantic embeddings for RAG and semantic search will be necessary. The current predominant model for AI-enabled features is via use of 3rd party external models (Anthropic and Vertex AI). An alternate path is being paved for use of self-hosted OS models in self-managed and airgapped environments. Important questions remain regarding how GitLab will enable customization for users on other GL surfaces, such as dot.com and Dedicated. Embedding poses a particular privacy concern, as it requires a customer sending an entire repo or documentation set to an embedding model to be indexed. While some customers may not have issues sending their documentation to our external partners, for others this is a non-starter. We need to determine the optimal, scalable path for GL customers to leverage RAG and other embedding-based approaches in their customizations, while still maintaining data privacy.

Note

There are several potential permutations for a RAG approach for feature customizations, as outline below. While it makes sense to enable entirely 3rd party or entirely self-hosted options, we would preferably choose just one hybrid approach to support.

entirely 3rd party supported
- For those customers who are not concerned with sharing repository information with a third party such as Google, we have a lot of flexibility. We can continue to use the textembedding-gecko model provided by Vertex AI as an embedding solution. For vector storage and retrieval we can use PostgreSQL with PGVector (as is currently used for GitLab documentation embeddings) or any other vector storage and retrieval such as Elasticsearch or Vertex AI Search. The enriched prompt can continue be sent to our third party LLMs via API.
hybrid approach - self-hosted embedding model + 3rd party LLM
- Gitlab uses the initial Custom Model approach of allowing configuration to a self-hosted embedding model. The customer hosts the embedding model, and we allow configuration to it via a self-hosted version of the AI Gateway. Users can host their own indexes with retrieved context routed via the AI Gateway for prompt injection. We then continue to leverage 3rd party LLMs. This allows the bulk of customer information to remain localized, with only the data injected in the prompt being sent to an external party.
  - customers would need to host and maintain their own embedding models
  - this would shift the GPU requirements and compute costs to customer infrastructure
  - customers have to run a model that works with our features and vector store. For example, if our database schema requires the dimension to be 768 of vectors, their embedding model also needs to produce the same dimension of vectors.
  - this works in air-gapped solution.
hybrid approach - GL-hosted embedding model + 3rd party LLM
- GitLab hosts an embedding model on the AI Gateway. User's keep their indexes within their instances, and retrieval likewise occurs via the AI Gateway. We then continue to leverage 3rd party LLMs. This allows the bulk of customer information to remain localized, with only the data injected in the prompt being sent to an external party.
  - customers don't need to maintain their model.
  - the data sent for embedding won't reach 3rd party model provider.
  - GL needs to evaluate and understand the limitations of any OS model we host
  - Runway would need to add a support for GPU-based infrastructure; this could not be executed on Cloud Run (the current CPU-based instance), as it would lead to a scalability issue.
  - running costs would be higher, and would require consideration of the pricing strategy
  - this would not work for airgapped solutions
entirely self-hosted
- For the most data-sensitive customers (entirely airgapped), we will want to provide a RAG option that is completely offline and self-hosted to augment airgapped self-hosted feature development. This would allow the embedding, storage, and retrieval mechanisms to all remain local. Any architecture developed for Code Suggestion RAG would also be applicable for Chat RAG. This enriched prompt would be sent to a self-hosted LLM, for total containment.
entirely GL hosted - GL-hosted embedding model + GL-hosted LLM
- Non air-gapped customers can connect to the GL-hosted AIGW to generate embeddings which are stored on their instances and make prompt calls to an LLM hosted by GL via the GL AIGW.
  - customers don't need to maintain their model.
  - the data sent for embedding won't reach 3rd party model provider.
  - GL needs to evaluate and understand the limitations of any OS model we host
  - Runway would need to add a support for GPU-based infrastructure; this could not be executed on Cloud Run (the current CPU-based instance), as it would lead to a scalability issue.
  - running costs would be higher, and would require consideration of the pricing strategy

Timebox Expectations

Per our handbook spike guidelines, ideally this should be 1 week's effort but an update should be provided if the spike is more complex and might take longer than a week to investigate.

Expected Outcomes

a well-considered path forward is determined for privacy-enabled semantic embedding
Technical proposal added to #456224

Edited May 01, 2024 by Susie Bitters