Handling Chat embeddings for SM/Dedicated
Problem
Duo Chat currently relies on embeddings for parts of its functionality (currently: GitLab Docs). The way this process works for SaaS is as follows:
- Documentation markdown files are chunked and stored in a Postgres database. This happens every day in a Sidekiq job, which is accounting for the fact that for SaaS, docs are constantly in flux.
- When a user asks a question, it is sent to the AI model to turned it into an embedding
- The embeddings storage is searched for matching embeddings using the question embedding and a vector proximity search, and the associated text is returned
- The content retrieved this way is sent as context along with the prompt to the model to produce the final answer
This can be visualized with this sequence diagram:
sequenceDiagram
autonumber
participant U as User
participant GL as GitLab
participant DB as Embeddings Database
participant M as AI model
Note over GL,M: Cloud-managed
loop Sidekiq: Update docs embeddings
GL->>GL: Parse and chunk docs markdown
GL->>M: Fetch embeddings
M-->>GL: Docs embeddings
GL->>DB: Store embeddings
end
U->>GL: Send question
GL->>M: Encode question
M-->>GL: Question embedding
GL->>DB: Proximity search with question embedding
DB-->>GL: N closest text matches
GL->>M: Send prompt with text matches
M-->>GL: Final answer
GL-->>U: Final answer
In order to bring this to self-managed/Dedicated, we need to address the following questions:
- Where are embeddings stored? We could use storage that is local to the GitLab instance, or host this database on behalf of customers.
-
How are embeddings stored? We currently use a dedicated Postgres database to store embeddings. In order to support vector based search queries, it requires the
pgvector
extension to be installed, which is not the case for our default PG setup for SM.- We decided to use the same storage combo for SM. We verified that this extension is compliant with GL licensing, we can ship it to customers, and is supported by all cloud providers we support as per this comment. We furthermore discarded the option of using Cloud SQL as per this comment.
- How are embeddings populated? Regardless of whether we use local or remote storage, an open question is how this database would be populated. This step requires parsing GitLab docs markdown and making calls into the AI model to turn them into embeddings. This work is subject to AI vendor quotas.
-
AI model support: We currently support OpenAI embeddings with VertexAI support being added but still experimental. An open question was whether we need to support both for SM too.
- We currently expect that we will complete the work to move away from OpenAI embeddings before finishing the work described here. This work is handled by groupai framework.
Solution exploration
Descoped: dynamic and private data
Generally, we can think of data broadly in the following dimensions:
Nature | Availability |
---|---|
static | public |
dynamic | private |
Embeddings won't just be used for GitLab docs, which are static/public in nature. They will also be needed for MRs, issues, source code etc. i.e. data that is different for each customer and may be private in nature i.e. not allowed to leave their instances.
Related discussion in this thread.
Constraints
We need to consider the following constraints for working out a solution:
- Storage size. If we decide to ship some sort of embeddings artifact to customers (a pre-seeded DB or an intermediate format used to import embeddings), or directly produce embeddings in the customer instance, we are constrained by database growth as mentioned here.
- AI model quotas. Any solution will require us to call into the AI model to retrieve embeddings. For the existing SaaS solution we are currently constrained to 600 RPM as mentioned here. Our goal should be to not materially add to this request volume, or ask to get this raised.
Solution dimensions
This is a multi-dimensional problem since there are various ways to handle each aspect outlined in the problem statement. This section briefly summarizes these dimensions.
Local vs remote storage
Embeddings storage could live either on premises or be hosted by us. This gives rise to the following options:
-
Embeddings are shipped to customer
- Create embeddings build artifact during release, ship it to customers. We could either pre-seed a database from the documentation text as it was current at the time of release, or create some other intermediate representation (e.g. JSON dump) that we then bundle and ship with a milestone release. Ideally this only happens once at the time we promote a release since it requires talking to the AI model to obtain embeddings. We identified challenges with this since our release pipeline produces immutable packages, so at the time this happens we cannot include anything else in the package anymore.
- Create embeddings build artifact during release, make it available for import. Alternatively, we could create this artifact, but instead of bundling it with the release, make it available for download somewhere so that SM instance can import it, either during the upgrade process or in response to an application trigger (e.g. enabling the Chat feature.)
-
Embeddings are hosted by us
- Serve embeddings from SaaS database. We already import embeddings on a nightly basis for SaaS. We could make this data available to SM too. The main problems to solve here would be that we would have to start versioning this data since SM instances need a specific fixed-time view on this data, and we would have to make it available through an API so the application can request it.
-
Serve embeddings from dedicated database. Alternatively, we could host a dedicated Postgres embeddings database for Cloud Connector customers. This database could be populated with similar mechanisms as outlined under
Embeddings are shipped to customer
, or be produced as a snapshot/dump from the SaaS DB at the time of release.
Examples: Prospective solutions (not complete)
Note: this is not a one-dimensional problem so there are valid permutations of some of the solutions outlined here and they are not all listed in detail.
Approach 1: Pre-seeded database at customer site + AI gateway embeddings API
In this approach we would merely push down 3P model access into the AI gateway but retain the overall "protocol" between the GitLab application and the model, i.e. the main logic remains in gitlab-rails and leaves the AI gateway be a simple proxy. This necessitates that documentation embeddings are made available on premises since the question vector is an input to the text search. It is still unclear how that would work, e.g. by making it available as a download:
sequenceDiagram
autonumber
participant U as User
participant GL as GitLab
participant DB as Embeddings Database
participant AI as AI gateway
participant M as AI model
Note over U,DB: Self-managed
Note over AI,M: Cloud-managed
U->>GL: Send question
GL->>AI: Encode question
AI->>M: Request question embedding
M-->>AI: Question embedding
AI-->>GL: Question embedding
GL->>DB: Proximity search with question embedding
DB-->>GL: N closest text matches
GL->>AI: Send prompt with text matches
AI->>M: Send prompt with text matches
M-->>AI: Final answer
AI-->>GL: Final answer
GL-->>U: Final answer
Approach 2: GitLab-managed database + rich AI gateway chat API
In this approach, we push more functionality into the GitLab infrastructure by simplifying the protocol between GitLab and the AI gateway. Here, the GitLab application only sends the original question to the AI gateway, the AI gateway then executes the internal protocol, including querying an embeddings database. It is also TBD yet how this database would be maintained for self-managed since it would require timestamped/snapshotted documentation by version. It also means it would change the AI gateway from being a stateless service to a stateful one because it now talks to connected storage:
sequenceDiagram
autonumber
participant U as User
participant GL as GitLab
participant AI as AI gateway
participant DB as Embeddings Database
participant M as AI model
Note over U,GL: Self-managed
Note over AI,M: Cloud-managed
U->>GL: Send question
GL->>AI: Send question
AI->>M: Request question embedding
M-->>AI: Question embedding
AI->>DB: Proximity search with question embedding
DB-->>AI: N closest text matches
AI->>M: Send prompt with text matches
M-->>AI: Final answer
AI-->>GL: Final answer
GL-->>U: Final answer