Spin up a new webservice to serve the endpoints that external clients will talk to
For the adjusted proposal consolidating the way we consume AI-models discussed in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#161 (closed), we intend to have all external clients (for example VSCode or an LSP server implementation) to talk to the GitLab-Rails monolith.
These request could be long-running and will often have external dependencies (Vertex, Anthropic), so the request could occupy a puma worker for a long time, potentially starving other requests. To get around this, we should spin up a new fleet to serve these endpoints. This is a fleet similar to the ones we already have (web
, api
, git
, websockets
, internal-api
), and we should configure our frontend (HAProxy) to correctly route traffic for the specific AI endpoints to this new fleet.
Besides limiting the blast radius of this workload, it could also allow us to play with the puma configuration to support higher concurrency.