Pad input tensors to enable dynamic batching (!131) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / AI Gateway

This MR changes the preprocessing step to pad prompts up to the largest in the batch.

Settings used to configure dynamic batching

2-gpus model (current prod setting)
dynamic_batching enabled
- max_batch_size: 128
- max_queue_delay_microseconds: 350 000 in preprocessing.
  The FT model performs slower than other components so we can have
  a small preprocessing delay to fit up to 128 prompts into a batch and increase throughput.

Benchmark

Summary

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 1.33314 infer/sec, latency 831938 usec
Concurrency: 11, throughput: 10.8872 infer/sec, latency 1036064 usec
Concurrency: 21, throughput: 19.8302 infer/sec, latency 1086208 usec
Concurrency: 31, throughput: 28.0506 infer/sec, latency 1110993 usec
Concurrency: 41, throughput: 35.2168 infer/sec, latency 1144217 usec
Concurrency: 51, throughput: 43.7126 infer/sec, latency 1161575 usec
Concurrency: 61, throughput: 51.8798 infer/sec, latency 1169022 usec
Concurrency: 71, throughput: 58.5983 infer/sec, latency 1202156 usec
Concurrency: 81, throughput: 66.9279 infer/sec, latency 1242497 usec
Concurrency: 91, throughput: 73.6508 infer/sec, latency 1259617 usec

Content used to benchmark

test_v3.json

Command used to run:

perf_analyzer -m ensemble --percentile=95 --concurrency-range 1:100:10 --input-data test

Closes #127 (closed)

Edited Jun 02, 2023 by Alexander Chueshev

Pad input tensors to enable dynamic batching

Settings used to configure dynamic batching

Benchmark

Content used to benchmark

Command used to run:

Merge request reports