AI Gateway as the Sole Access Point for Monolith to Access Models
## Background
The AI Gateway, a standalone service, is designed to be the central conduit for all communication between GitLab installations and third-party AI models. With future expansions including telemetry, embeddings API, and multi-region/customer-specific deployments, our goal is to provide a scalable, comprehensive AI solution for all GitLab users, regardless of their installation type.
## Goal
- Enabling AI features on SM
- Enabling Telemetry
- Enabling being able to swap models
- Supporting Duo Enterprise
- Avoid if we can a central API supporting a huge array of client versions
## Why is this important
* As a strategy we want to be able to keep 3rd party models at arms length from our product and features
* Making us as model agnostic as possible
* Enable the gateway to support multiple use cases
* AI usage, cost and nature it is important that we would have a layer that will centralize AI related aspects
* Telemetry for usage and cost monitoring
* Manage scalability
* The ability to have a backup in the event of a provider outage
## Related work
* [Expand the maintainers of the AI Gateway](https://gitlab.com/gitlab-org/ai-powered/ai-framework/team-hq/-/issues/26 "Extend AI Gateway Maintainers")
* As of today, AI Gateway is a small project with less than 10 internal contributors. We are aiming to increase the number of maintainers to ensure the sustainability and growth of the project.
* Telemetry service - To effectively track usage and costs, we plan to centralize telemetry within the AI Gateway, establishing it as the SSoT for our metrics.
* We may also consider deploying a separate telemetry service in [Runway](https://about.gitlab.com/direction/saas-platforms/scalability/runway/) that communicates with the AI Gateway.
* [Evaluate experiments](https://gitlab.com/gitlab-com/content-sites/internal-handbook/-/merge_requests/4101)
* As we develop AI-powered features, we leverage three layers of testing to assess their effectiveness: first, the quality of the feature's end-to-end flow through SET Quality Testing; second, a comprehensive evaluation framework for the AI-generated content via the [Centralized Evaluation Framework](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation); and third, [diagnostic tools](https://gitlab.com/gitlab-org/ai-powered/ai-framework/qa-evaluation) that utilize a subset of the evaluation framework for quick iteration and diagnosis, known as Diagnostic Testing.
* The experimentation layer lives in https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments/-/issues where the ~"group::ai model validation" team supply ongoing guidance in a framework to evaluate changes in our AI features.
* [Gateway into GDK](https://gitlab.com/gitlab-org/gitlab-development-kit/-/issues/2025 "Include AI Gateway in GDK")
* This will make it easier to run Duo Chat / CS locally for frontenders and non technical people, akin to this work we also want to consider efforts such as https://gitlab.com/gitlab-org/gitlab/-/issues/439326+s to ensure a smoother developer experience.
* Embeddings API
* We're considering hosting the process of creating and searching embeddings within the AI Gateway. This approach would align with our logic for language models, providing the flexibility to change the model and store without impacting the features that use them.
## Current Feature Outline
| Feature | Request Path | Third Party Model | Model Regions |
|-----------------------------------|------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------|
| Duo Chat | (GitLab Rails instance) => cloud.gitlab.com LB => GCP Global App LB => Cloud Run / AI gateway => LLM | Anthropic Claude-3 / Vertex AI Codey textembedding-gecko | US, TBC / US, EU, APAC, TBC |
| Code suggestions | (GitLab Rails instance) => cloud.gitlab.com LB => GCP Global App LB => Cloud Run / AI gateway => LLM | For Code Completion: Vertex AI Codey code-gecko For Code Generation: Anthropic Claude-2 | US, EU, APAC, TBC / US, TBC |
| Git Suggestions | (GitLab Rails instance) => GCP Global App LB => LLM | Vertex AI Codey codechat-bison | US, EU, APAC, TBC |
| Discussion Summary | (GitLab Rails instance) => GCP Global App LB => LLM | Vertex AI Codey text-bison | US, EU, APAC, TBC |
| Issue Description Generation | (GitLab Rails instance) => GCP Global App LB => LLM | Anthropic Claude-2 | US, TBC |
| Test generation | (GitLab Rails instance) => cloud.gitlab.com LB => GCP Global App LB => Cloud Run / AI gateway => LLM | Anthropic Claude-3 | US, TBC |
| Merge request template population | (GitLab Rails instance) => GCP Global App LB => LLM | Vertex AI Codey text-bison | US, EU, APAC, TBC |
| Suggested Reviewers | (GitLab Rails instance) => GCP Global App LB => Custom LLM | Custom - N/A | US, EU, APAC, TBC |
| Merge request summary | (GitLab Rails instance) => GCP Global App LB => LLM | Vertex AI Codey text-bison | US, EU, APAC, TBC |
| Code review summary | (GitLab Rails instance) => GCP Global App LB => LLM | Vertex AI Codey text-bison | US, EU, APAC, TBC |
| Vulnerability explanation | (GitLab Rails instance) => GCP Global App LB => LLM | Vertex AI Codey text-bison Anthropic Claude-2 if degraded performance | US, EU, APAC, TBC / US,TBC |
| Vulnerability resolution | (GitLab Rails instance) => GCP Global App LB => LLM | Vertex AI Codey code-bison | US, EU, APAC, TBC |
| Code explanation | (GitLab Rails instance) => cloud.gitlab.com LB => GCP Global App LB => Cloud Run / AI gateway => LLM | Vertex AI Codey codechat-bison | US, EU, APAC, TBC |
| Root cause analysis | (GitLab Rails instance) => GCP Global App LB => LLM | Vertex AI Codey text-bison | US, EU, APAC, TBC |
| Value stream forecasting | (GitLab Rails instance) => GCP Global App LB => Custom LLM | Custom - N/A | US, EU, APAC, TBC |
## Plan
#### Feature migration
GitLab currently hosts [16 AI powered features](https://handbook.gitlab.com/handbook/engineering/development/data-science/#features), varying in scope and maturity. The migration of these features to the Gateway is partially dependent on the decision and priority to make them generally available (GA). The AI Framework team will be the Directly Responsible Individual (DRI) to oversee this centralization process.
Features that are connected to the Gateway:
- Code suggestions
- X-Ray
- Duo Chat (with the exclusion of doc related questions)
The process of moving the other features to the Gateway is to some extent tied with the decision and priority to GA them and should follow these considerations. AI Framework would be **the DRI to centralize the process as whole**.
* Features that should be included in Chat, and by product will be part of the Gateway
* DRI: Duo chat team
* Support: AI Framework
* Features that should be GA but not as part of chat- if they exists will be moved and if they are yet to be build will be built with a "Gateway first" approach
* DRI: relevant group for the feature domain
* Support: AI Framework would provide with support to facilitate the move (including engineering and product and UX help)
* Features that due to metrics, team recommendation or strategic guidance we will recommend to deprecate
* DRI: relevant group for the feature domain
* Make the feature available to SM customers
* DRI: Cloud connector
This high-level plan and the associated projects are projected to span a period of 6 months. This includes support for teams who will GA their features either as part of Duo Chat or as a standalone feature within the Duo Add-on.
#### Phase 1A - rapid moving of features to AI GW
* in AI GW: creating transparent proxies endpoints for Anthropic and Vertex
* in Rails Monolith (later on called RM): refactoring Anthropic and Vertex clients to call AI GW instead of third-party AI providers directly
#### Phase 1B - enabling features for SM instances
* in RM: working on authorization part of features (policies, checks) to be compatible with self-managed instances
* in C.Dot: issuing tokens for features that are being moved
#### Phase 2 - Developing more robust AI GW
* In AI GW: creating one endpoint per feature, that also covers prompt library/registry part (covered in another MR) \[ @shekharpatnaik's proposal\]
* in RM: switching features to using one unified AI GW client instead of Anthropic/Vertex clients
## Required Work
| | | | | |
|--|--|--|--|--|
| https://gitlab.com/gitlab-org/gitlab/-/issues/454543+s | Gitlab Rails / AIGW | ~"group::ai framework" | ~"workflow::in dev" | %"17.1" |
| https://gitlab.com/gitlab-org/gitlab/-/issues/458207+ | Gitlab Rails / AIGW | ~"group::ai framework" | ~"workflow::in dev" | %"17.1" |
| https://gitlab.com/gitlab-org/gitlab/-/issues/460473+ | Gitlab Rails | ~"group::ai framework" | ~"workflow::in dev" | %"17.1" |
| https://gitlab.com/gitlab-org/gitlab/-/issues/456103+ | Gitlab Rails | ~"group::ai framework" | ~"workflow::refinement" | %"17.1" |
epic