Skip to content

Pad input tensors to enable dynamic batching

This MR changes the preprocessing step to pad prompts up to the largest in the batch.

Settings used to configure dynamic batching

  • 2-gpus model (current prod setting)
  • dynamic_batching enabled
    • max_batch_size: 128
    • max_queue_delay_microseconds: 350 000 in preprocessing.
      The FT model performs slower than other components so we can have
      a small preprocessing delay to fit up to 128 prompts into a batch and increase throughput.

Benchmark

log.out

Summary
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 1.33314 infer/sec, latency 831938 usec
Concurrency: 11, throughput: 10.8872 infer/sec, latency 1036064 usec
Concurrency: 21, throughput: 19.8302 infer/sec, latency 1086208 usec
Concurrency: 31, throughput: 28.0506 infer/sec, latency 1110993 usec
Concurrency: 41, throughput: 35.2168 infer/sec, latency 1144217 usec
Concurrency: 51, throughput: 43.7126 infer/sec, latency 1161575 usec
Concurrency: 61, throughput: 51.8798 infer/sec, latency 1169022 usec
Concurrency: 71, throughput: 58.5983 infer/sec, latency 1202156 usec
Concurrency: 81, throughput: 66.9279 infer/sec, latency 1242497 usec
Concurrency: 91, throughput: 73.6508 infer/sec, latency 1259617 usec

Content used to benchmark

test_v3.json

Command used to run:

perf_analyzer -m ensemble --percentile=95 --concurrency-range 1:100:10 --input-data test

Closes #127 (closed)

Edited by Alexander Chueshev

Merge request reports

Loading