Reduce latency by switching to async Vertex client and single tokenizer instance
The existing prod version uses run_in_threadpool
to launch the full business logic and API calls.
This MR eliminates the need for run_in_threadpool
by replacing the sync Vertex client with its async version.
According to the benchmarks below, this change, along with running a single tokenizer instance, seems to reduce latency.
Recall, run_in_threadpool
spawns a worker thread (GIL, limited number, context switching), and the async client represents a couritine object.
Docs about running sync functions in a worker thread https://anyio.readthedocs.io/en/stable/threads.html#working-with-threads.
Logic used by run_in_threadpool
underhood.
Benchmarks
Staging environment:
- 3 model-gateway pods
- no h. autoscaling
Main branch
Benchmarking model-k8s-gateway (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests
Server Software:
Server Hostname: model-k8s-gateway
Server Port: 8080
Document Path: /v2/completions
Document Length: 254 bytes
Concurrency Level: 20
Time taken for tests: 48.940 seconds
Complete requests: 500
Failed requests: 193
(Connect: 0, Receive: 0, Length: 193, Exceptions: 0)
Total transferred: 228121 bytes
Total body sent: 240500
HTML transferred: 123260 bytes
Requests per second: 10.22 [#/sec] (mean)
Time per request: 1957.600 [ms] (mean)
Time per request: 97.880 [ms] (mean, across all concurrent requests)
Transfer rate: 4.55 [Kbytes/sec] received
4.80 kb/s sent
9.35 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 4.4 0 34
Processing: 506 1922 997.0 1694 5099
Waiting: 484 1756 860.5 1562 4704
Total: 506 1923 997.0 1694 5099
Percentage of the requests served within a certain time (ms)
50% 1694
66% 2255
75% 2520
80% 2710
90% 3463
95% 3973
98% 4328
99% 4580
100% 5099 (longest request)
Current change
Benchmarking model-k8s-gateway (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests
Server Software:
Server Hostname: model-k8s-gateway
Server Port: 8080
Document Path: /v2/completions
Document Length: 254 bytes
Concurrency Level: 20
Time taken for tests: 10.075 seconds
Complete requests: 500
Failed requests: 216
(Connect: 0, Receive: 0, Length: 216, Exceptions: 0)
Total transferred: 227626 bytes
Total body sent: 240500
HTML transferred: 122463 bytes
Requests per second: 49.63 [#/sec] (mean)
Time per request: 403.000 [ms] (mean)
Time per request: 20.150 [ms] (mean, across all concurrent requests)
Transfer rate: 22.06 [Kbytes/sec] received
23.31 kb/s sent
45.38 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 1.6 0 28
Processing: 209 392 108.6 382 970
Waiting: 208 391 108.5 380 970
Total: 209 392 108.8 382 971
Percentage of the requests served within a certain time (ms)
50% 382
66% 403
75% 436
80% 445
90% 466
95% 502
98% 820
99% 897
100% 971 (longest request)
How to run the benchmarks:
-
For testing purposes, there is an
ab
pod deployed in the staging cluster:kubectl get pods -n fauxpilot | grep "ab"
-
Log into the pod:
kubectl exec -it ab -n fauxpilot -- sh
-
Create a
prompt.txt
file with the following content:Click to expand
{ "prompt_version": 1, "project_path": "gitlab-org/modelops/applied-ml/review-recommender/pipeline-scheduler", "project_id": 33191677, "current_file": { "file_name": "test.py", "content_above_cursor": "def hello_world", "content_below_cursor": "" } }
-
Launch the benchmark:
ab -n 500 -c 20 -H "Authorization: Bearer PAT" -T 'application/json' -p prompt.txt http://model-k8s-gateway:8080/v2/completions