Reduce latency by switching to async Vertex client and single tokenizer instance (!236) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / AI Gateway

Alexander Chueshev requested to merge async-vertex-client into main Jul 20, 2023

The existing prod version uses run_in_threadpool to launch the full business logic and API calls. This MR eliminates the need for run_in_threadpool by replacing the sync Vertex client with its async version. According to the benchmarks below, this change, along with running a single tokenizer instance, seems to reduce latency.

Recall, run_in_threadpool spawns a worker thread (GIL, limited number, context switching), and the async client represents a couritine object. Docs about running sync functions in a worker thread https://anyio.readthedocs.io/en/stable/threads.html#working-with-threads. Logic used by run_in_threadpool underhood.

Benchmarks

Staging environment:

3 model-gateway pods
no h. autoscaling

Main branch

Benchmarking model-k8s-gateway (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:
Server Hostname:        model-k8s-gateway
Server Port:            8080

Document Path:          /v2/completions
Document Length:        254 bytes

Concurrency Level:      20
Time taken for tests:   48.940 seconds
Complete requests:      500
Failed requests:        193
   (Connect: 0, Receive: 0, Length: 193, Exceptions: 0)
Total transferred:      228121 bytes
Total body sent:        240500
HTML transferred:       123260 bytes
Requests per second:    10.22 [#/sec] (mean)
Time per request:       1957.600 [ms] (mean)
Time per request:       97.880 [ms] (mean, across all concurrent requests)
Transfer rate:          4.55 [Kbytes/sec] received
                        4.80 kb/s sent
                        9.35 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   4.4      0      34
Processing:   506 1922 997.0   1694    5099
Waiting:      484 1756 860.5   1562    4704
Total:        506 1923 997.0   1694    5099

Percentage of the requests served within a certain time (ms)
  50%   1694
  66%   2255
  75%   2520
  80%   2710
  90%   3463
  95%   3973
  98%   4328
  99%   4580
 100%   5099 (longest request)

Current change

Benchmarking model-k8s-gateway (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:
Server Hostname:        model-k8s-gateway
Server Port:            8080

Document Path:          /v2/completions
Document Length:        254 bytes

Concurrency Level:      20
Time taken for tests:   10.075 seconds
Complete requests:      500
Failed requests:        216
   (Connect: 0, Receive: 0, Length: 216, Exceptions: 0)
Total transferred:      227626 bytes
Total body sent:        240500
HTML transferred:       122463 bytes
Requests per second:    49.63 [#/sec] (mean)
Time per request:       403.000 [ms] (mean)
Time per request:       20.150 [ms] (mean, across all concurrent requests)
Transfer rate:          22.06 [Kbytes/sec] received
                        23.31 kb/s sent
                        45.38 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   1.6      0      28
Processing:   209  392 108.6    382     970
Waiting:      208  391 108.5    380     970
Total:        209  392 108.8    382     971

Percentage of the requests served within a certain time (ms)
  50%    382
  66%    403
  75%    436
  80%    445
  90%    466
  95%    502
  98%    820
  99%    897
 100%    971 (longest request)

How to run the benchmarks:

For testing purposes, there is an ab pod deployed in the staging cluster:
```
kubectl get pods -n fauxpilot | grep "ab"
```
Log into the pod:
```
kubectl exec -it ab -n fauxpilot -- sh
```

Create a prompt.txt file with the following content:

Click to expand

{
  "prompt_version": 1,
  "project_path": "gitlab-org/modelops/applied-ml/review-recommender/pipeline-scheduler",
  "project_id": 33191677,
  "current_file": {
     "file_name": "test.py",
     "content_above_cursor": "def hello_world",
     "content_below_cursor": ""
   }
}

Launch the benchmark:

ab -n 500 -c 20 -H "Authorization: Bearer PAT" -T 'application/json' -p prompt.txt http://model-k8s-gateway:8080/v2/completions

Ref: #207 (comment 1478669491)

Edited Jul 20, 2023 by Alexander Chueshev

Reduce latency by switching to async Vertex client and single tokenizer instance

Benchmarks

How to run the benchmarks:

Merge request reports