Follow-up: Code-suggestions Latency

This issue describes a potential future follow on to !126327 (merged), which describes the request path for AI requests through the AI Abstraction Layer in the Rails monolith and the AI Gateway.

Code Suggestions Latency

Code Suggestions acceptance rates are highly sensitive to latency. While writing code with an AI assistant, a user will pause only for a short duration before continuing on with manually typing out a block of code. As soon as the user has pressed a subsequent keypress, the existing suggestion will be invalidated and a new request will need to be issued to the code suggestions endpoint. In turn, this request will also be highly sensitive to latency.

In a worst case with sufficient latency, the IDE could be issuing a string of requests, each of which is then ignored as the user proceeds without waiting for the response. This adds no value for the user, while still putting load on our services.

By choosing to route code suggestions requests through our Rails monolith, additional latency will undoubtably be added to the request. Compared to a lightweight stateless service latencies will be difficult to reduce owing to the relatively high complexity and number of layers in the Rails monolith.

For the first iteration of this change, we should pass requests through Rails and collect data on latencies vs acceptance rates so assess what a reasonable threshold latencies for these requests is.

This will provide us with data on whether this approach (of routing through Rails) is going to be workable going forward. If Rails is adding too much latency for Code Suggestions requests, we may need to consider iterating on the architecture to improve this.

Workaround: short-circuit code suggestions requests from Workhorse

If we were to find that this approach was adding too much latency, one potential iteration would be to offload the request in Workhorse. In this design, authentication is handled by Rails (and then cached) and requests are then handled directly by Workhorse and forwarded on to the AI Gateway immediately without having to queue for a Puma thread etc. This approach has many existing parallels in our architecture, for example in Git HTTP requests, Git LFS and other S3 operations, and some Runner operations.

This would mean that, for improved latency, preprocessing might be better done in the client, removing it from the hot-path of each individual code suggestion request.

cc @m_gill