Handling Chat embeddings for SM/Dedicated

Problem

Duo Chat currently relies on embeddings for parts of its functionality (currently: GitLab Docs). The way this process works for SaaS is as follows:

Documentation markdown files are chunked and stored in a Postgres database. This happens every day in a Sidekiq job, which is accounting for the fact that for SaaS, docs are constantly in flux.
When a user asks a question, it is sent to the AI model to turned it into an embedding
The embeddings storage is searched for matching embeddings using the question embedding and a vector proximity search, and the associated text is returned
The content retrieved this way is sent as context along with the prompt to the model to produce the final answer

This can be visualized with this sequence diagram:

sequenceDiagram
    autonumber
    participant U as User
    participant GL as GitLab
    participant DB as Embeddings Database
    participant M as AI model
    
    Note over GL,M: Cloud-managed

    loop Sidekiq: Update docs embeddings
        GL->>GL: Parse and chunk docs markdown
        GL->>M: Fetch embeddings
        M-->>GL: Docs embeddings
        GL->>DB: Store embeddings
    end
 
    U->>GL: Send question
    GL->>M: Encode question
    M-->>GL: Question embedding
    GL->>DB: Proximity search with question embedding
    DB-->>GL: N closest text matches
    GL->>M: Send prompt with text matches
    M-->>GL: Final answer
    GL-->>U: Final answer

In order to bring this to self-managed/Dedicated, we need to address the following questions:

Where are embeddings stored? We could use storage that is local to the GitLab instance, or host this database on behalf of customers.
How are embeddings stored? We currently use a dedicated Postgres database to store embeddings. In order to support vector based search queries, it requires the pgvector extension to be installed, which is not the case for our default PG setup for SM.
- We decided to use the same storage combo for SM. We verified that this extension is compliant with GL licensing, we can ship it to customers, and is supported by all cloud providers we support as per this comment. We furthermore discarded the option of using Cloud SQL as per this comment.
How are embeddings populated? Regardless of whether we use local or remote storage, an open question is how this database would be populated. This step requires parsing GitLab docs markdown and making calls into the AI model to turn them into embeddings. This work is subject to AI vendor quotas.
AI model support: We currently support OpenAI embeddings with VertexAI support being added but still experimental. An open question was whether we need to support both for SM too.
- We currently expect that we will complete the work to move away from OpenAI embeddings before finishing the work described here. This work is handled by groupai framework.

Solution exploration

Descoped: dynamic and private data

Generally, we can think of data broadly in the following dimensions:

Nature	Availability
static	public
dynamic	private

Embeddings won't just be used for GitLab docs, which are static/public in nature. They will also be needed for MRs, issues, source code etc. i.e. data that is different for each customer and may be private in nature i.e. not allowed to leave their instances.

Related discussion in this thread.

Constraints

We need to consider the following constraints for working out a solution:

Storage size. If we decide to ship some sort of embeddings artifact to customers (a pre-seeded DB or an intermediate format used to import embeddings), or directly produce embeddings in the customer instance, we are constrained by database growth as mentioned here.
AI model quotas. Any solution will require us to call into the AI model to retrieve embeddings. For the existing SaaS solution we are currently constrained to 600 RPM as mentioned here. Our goal should be to not materially add to this request volume, or ask to get this raised.

Solution dimensions

This is a multi-dimensional problem since there are various ways to handle each aspect outlined in the problem statement. This section briefly summarizes these dimensions.

Local vs remote storage

Embeddings storage could live either on premises or be hosted by us. This gives rise to the following options:

Embeddings are shipped to customer
1. Create embeddings build artifact during release, ship it to customers. We could either pre-seed a database from the documentation text as it was current at the time of release, or create some other intermediate representation (e.g. JSON dump) that we then bundle and ship with a milestone release. Ideally this only happens once at the time we promote a release since it requires talking to the AI model to obtain embeddings. We identified challenges with this since our release pipeline produces immutable packages, so at the time this happens we cannot include anything else in the package anymore.
2. Create embeddings build artifact during release, make it available for import. Alternatively, we could create this artifact, but instead of bundling it with the release, make it available for download somewhere so that SM instance can import it, either during the upgrade process or in response to an application trigger (e.g. enabling the Chat feature.)
Embeddings are hosted by us
1. Serve embeddings from SaaS database. We already import embeddings on a nightly basis for SaaS. We could make this data available to SM too. The main problems to solve here would be that we would have to start versioning this data since SM instances need a specific fixed-time view on this data, and we would have to make it available through an API so the application can request it.
2. Serve embeddings from dedicated database. Alternatively, we could host a dedicated Postgres embeddings database for Cloud Connector customers. This database could be populated with similar mechanisms as outlined under Embeddings are shipped to customer, or be produced as a snapshot/dump from the SaaS DB at the time of release.

Examples: Prospective solutions (not complete)

Note: this is not a one-dimensional problem so there are valid permutations of some of the solutions outlined here and they are not all listed in detail.

Approach 1: Pre-seeded database at customer site + AI gateway embeddings API

In this approach we would merely push down 3P model access into the AI gateway but retain the overall "protocol" between the GitLab application and the model, i.e. the main logic remains in gitlab-rails and leaves the AI gateway be a simple proxy. This necessitates that documentation embeddings are made available on premises since the question vector is an input to the text search. It is still unclear how that would work, e.g. by making it available as a download:

sequenceDiagram
    autonumber
    participant U as User
    participant GL as GitLab
    participant DB as Embeddings Database
    participant AI as AI gateway
    participant M as AI model
    
    Note over U,DB: Self-managed
    Note over AI,M: Cloud-managed

    U->>GL: Send question
    GL->>AI: Encode question
    AI->>M: Request question embedding
    M-->>AI: Question embedding
    AI-->>GL: Question embedding
    GL->>DB: Proximity search with question embedding
    DB-->>GL: N closest text matches
    GL->>AI: Send prompt with text matches
    AI->>M: Send prompt with text matches
    M-->>AI: Final answer
    AI-->>GL: Final answer
    GL-->>U: Final answer

Approach 2: GitLab-managed database + rich AI gateway chat API

In this approach, we push more functionality into the GitLab infrastructure by simplifying the protocol between GitLab and the AI gateway. Here, the GitLab application only sends the original question to the AI gateway, the AI gateway then executes the internal protocol, including querying an embeddings database. It is also TBD yet how this database would be maintained for self-managed since it would require timestamped/snapshotted documentation by version. It also means it would change the AI gateway from being a stateless service to a stateful one because it now talks to connected storage:

sequenceDiagram
    autonumber
    participant U as User
    participant GL as GitLab
    participant AI as AI gateway
    participant DB as Embeddings Database
    participant M as AI model
    
    Note over U,GL: Self-managed
    Note over AI,M: Cloud-managed

    U->>GL: Send question
    GL->>AI: Send question
    AI->>M: Request question embedding
    M-->>AI: Question embedding
    AI->>DB: Proximity search with question embedding
    DB-->>AI: N closest text matches
    AI->>M: Send prompt with text matches
    M-->>AI: Final answer
    AI-->>GL: Final answer
    GL-->>U: Final answer

Edited Sep 29, 2023 by Matthias Käppler