Serve completions endpoint through Workhorse (!126957) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 418893-codesuggestions-sendurl into master Jul 19, 2023

What does this MR do and why?

In !125401 (merged) we introduced a new endpoint, /code_suggestions/completions that handles client requests for code completions in Rails and that talks to the model gateway via an HTTP call in Rails. This presents a performance and availability problem since Puma workers will block on this call and cannot serve other requests.

In order to not block the calling Puma worker thread while waiting for a response from the AI gateway, we are offloading this to Workhorse instead with this MR. We can use sendurl for this by returning the POST body to WH, which then issues the request to the AI gateway and delivers the response to the caller.

Screenshots or screen recordings

I verified that this largely solves the availability problem in #418203 (closed) by running the same load test (see that issue for how it was set up.)

Request POST /code_suggestions/completions at 5rps:

cat gitlab.http | vegeta attack -format=http -rate=5 -duration=10s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         50, 5.10, 4.22
Duration      [total, attack, wait]             11.846s, 9.8s, 2.047s
Latencies     [min, mean, 50, 90, 95, 99, max]  2.034s, 2.045s, 2.044s, 2.055s, 2.06s, 2.075s, 2.075s
Bytes In      [total, mean]                     9500, 190.00
Bytes Out     [total, mean]                     9100, 182.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:50  
Error Set:

We can see the injected 2 second latency here of the model gateway, but gitlab-rails does not experience it anymore (Workhorse and the client do).

Request /projects/:id at 1rps simultaneously:

cat project.http | vegeta attack -format=http -rate=1 -duration=10s | vegeta report
Requests      [total, rate, throughput]         10, 1.11, 1.10
Duration      [total, attack, wait]             9.132s, 9s, 131.892ms
Latencies     [min, mean, 50, 90, 95, 99, max]  124.671ms, 134.005ms, 131.959ms, 146.796ms, 147.441ms, 147.441ms, 147.441ms
Bytes In      [total, mean]                     44270, 4427.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:10  
Error Set:

These requests were sent in parallel to the completions requests. We can see that they do not suffer additional latency anymore but all returned within the usual time it takes when the system is not under load.

How to set up and validate locally

Run both gitlab-rails and the model gateway locally (set CODE_SUGGESTIONS_BASE_URL to point to the local MG instance)

send a completions request to Rails:

curl -v -H'Authorization: Bearer <snip>' -H'X-gitlab-oidc-token: <snip>' -H'content-type: application/json' -d'{
  "prompt_version": 1,
  "current_file": {
    "file_name": "test.py",
    "content_above_cursor": "def is_even(n: int) ->",
    "content_below_cursor": ""
  }
}' localhost:3000/api/v4/code_suggestions/completions

it should 200 OK, and in the WH logs you should see:

time="2023-07-19T08:59:08Z" level=info msg="SendURL: sending" correlation_id=01H5PNZNX728KBWG5X8ZGM0NKB path=/api/v4/code_suggestions/completions url="http://ai-gateway:5000/v2/completions"
localhost:3000 192.168.80.1 - - [2023/07/19:08:59:10 +0000] "POST /api/v4/code_suggestions/completions HTTP/1.1" 200 190 "" "curl/7.85.0" 5710

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #418893 (closed)

Edited Jul 20, 2023 by Matthias Käppler

Serve completions endpoint through Workhorse

What does this MR do and why?

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

Merge request reports