Skip to content

Pad input tensors to enable dynamic batching

Andras Herczeg requested to merge pad-requests into main

This MR changes the preprocessing step to pad prompts up to the largest in the batch.

Settings used to configure dynamic batching

  • 2-gpus model (current prod setting)
  • dynamic_batching enabled
    • max_batch_size: 128
    • max_queue_delay_microseconds: 350 000 in preprocessing.
      The FT model performs slower than other components so we can have
      a small preprocessing delay to fit up to 128 prompts into a batch and increase throughput.

Benchmark

log.out

Summary
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 1.33314 infer/sec, latency 831938 usec
Concurrency: 11, throughput: 10.8872 infer/sec, latency 1036064 usec
Concurrency: 21, throughput: 19.8302 infer/sec, latency 1086208 usec
Concurrency: 31, throughput: 28.0506 infer/sec, latency 1110993 usec
Concurrency: 41, throughput: 35.2168 infer/sec, latency 1144217 usec
Concurrency: 51, throughput: 43.7126 infer/sec, latency 1161575 usec
Concurrency: 61, throughput: 51.8798 infer/sec, latency 1169022 usec
Concurrency: 71, throughput: 58.5983 infer/sec, latency 1202156 usec
Concurrency: 81, throughput: 66.9279 infer/sec, latency 1242497 usec
Concurrency: 91, throughput: 73.6508 infer/sec, latency 1259617 usec

Content used to benchmark

test_v3.json

Command used to run:

perf_analyzer -m ensemble --percentile=95 --concurrency-range 1:100:10 --input-data test

Closes #127 (closed)

Edited by Alexander Chueshev

Merge request reports