Pad input tensors to enable dynamic batching
This MR changes the preprocessing step to pad prompts up to the largest in the batch.
Settings used to configure dynamic batching
- 2-gpus model (current prod setting)
-
dynamic_batching
enabledmax_batch_size: 128
-
max_queue_delay_microseconds: 350 000
in preprocessing.
The FT model performs slower than other components so we can have
a small preprocessing delay to fit up to 128 prompts into a batch and increase throughput.
Benchmark
Summary
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 1.33314 infer/sec, latency 831938 usec
Concurrency: 11, throughput: 10.8872 infer/sec, latency 1036064 usec
Concurrency: 21, throughput: 19.8302 infer/sec, latency 1086208 usec
Concurrency: 31, throughput: 28.0506 infer/sec, latency 1110993 usec
Concurrency: 41, throughput: 35.2168 infer/sec, latency 1144217 usec
Concurrency: 51, throughput: 43.7126 infer/sec, latency 1161575 usec
Concurrency: 61, throughput: 51.8798 infer/sec, latency 1169022 usec
Concurrency: 71, throughput: 58.5983 infer/sec, latency 1202156 usec
Concurrency: 81, throughput: 66.9279 infer/sec, latency 1242497 usec
Concurrency: 91, throughput: 73.6508 infer/sec, latency 1259617 usec
Content used to benchmark
Command used to run:
perf_analyzer -m ensemble --percentile=95 --concurrency-range 1:100:10 --input-data test
Closes #127 (closed)
Edited by Alexander Chueshev