Assess performance impact of /code_suggestions/completions

Problem

With the proposal in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#161 (closed), we will eventually stop requesting code suggestions from the model gateway directly and route these requests through gitlab-rails instead. For this purpose, a new endpoint was added in !125401 (merged): /code_suggestions/completions.

This will have an impact on the performance of code suggestions and availability of GitLab:

Additional latency. Since client requests need to traverse Workhorse and Rails first, which then creates a new TCP connection to the model gateway and forwards the request verbatim (for now) to the model gateway before serving an answer, we spend more time overall to serve code suggestions.
Availability risk. If the model gateway is down or its performance degraded, the calling Puma thread will currently block for up to 5 seconds waiting for a response. During this time, this worker thread can not service any other web clients, which impacts its ability to serve other, unrelated features.
Impact on self-managed. While the previous point can be side-stepped for SaaS by standing up a dedicate codesuggestions service to which traffic to this endpoint is routed, we can not do this as easily for self-managed customers.

The goal of this issue is to collect insight into the current behavior and failure modes of this endpoint and how we can mitigate associated risk.

We've conducted performance analysis of this endpoint to see how it behaves under load:

Additional latency: Extra time spent in Rails was 35ms in the 50th, 58ms in the 95th (this translates to 6.9% and 11.4% overhead respectively compared to going to the model gateway directly.) Most of the additional time is spent in looking up the user using an access token / PAT and verifying permissions. Given that the p95 response time for the MG is 500ms, this is unlikely to materially impact user experience.
Availability: As expected, introducing artificial latency in the model gateway (here, 2 seconds) to simulate an outage or degraded performance had significant impact on the server's ability to process other user requests, which is visible in additional queue time.

In order to improve fault tolerance, we should:

I created #418893 (closed) to track this.

Edited Jul 18, 2023 by Matthias Käppler