Skip to content

Add concurrent model execution to Triton

Alexander Chueshev requested to merge multi-gpu-triton-pod into main

What does this MR do and why?

This MR modifies the Helm chart and the K8s model loader job to allow concurrent model execution if the node pool provides appropriate support.

In the staging cluster (ai-assist-test), we already changed the GPU node pool, so one node contains 2 GPUs. In the prod cluster, we can allocate up to 16 A100 40GB GPUs per node. This will allow us to serve n replicas of Triton per node (instead of 1 now) with the concurrent model execution.

Benchmarks

Testing environment - GPU Topology

root@model-triton-777f7f4586-vhtcs:/tmp# nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      NV12    0-23            N/A
GPU1    NV12     X      0-23            N/A

We run perf_analyzer following @stanhu's approach in #77 (comment 1402837192).

Benchmark 1

  • instance_group = 1
  • pipeline_para_size = 2
  • tensor_para_size = 1 (1-gpu model not sharded beforehand during conversion)
  • disabled dynamic_batching
Results
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 2.4441 infer/sec, latency 409490 usec
Concurrency: 11, throughput: 2.49955 infer/sec, latency 4484831 usec
Concurrency: 21, throughput: 2.49963 infer/sec, latency 8560859 usec
Concurrency: 31, throughput: 2.44411 infer/sec, latency 12635924 usec
Concurrency: 41, throughput: 2.44408 infer/sec, latency 16711496 usec
Concurrency: 51, throughput: 2.44401 infer/sec, latency 17004964 usec
Concurrency: 61, throughput: 2.44399 infer/sec, latency 20786874 usec
Concurrency: 71, throughput: 2.444 infer/sec, latency 24861428 usec
Concurrency: 81, throughput: 2.44405 infer/sec, latency 28932473 usec
Concurrency: 91, throughput: 2.44408 infer/sec, latency 33024001 usec

Benchmark 2

  • instance_group = 2
  • pipeline_para_size = 2
  • tensor_para_size = 1 (1-gpu model not sharded beforehand during conversion)
  • disabled dynamic_batching
Results
Concurrency: 1, throughput: 2.44412 infer/sec, latency 409689 usec
Concurrency: 11, throughput: 2.72178 infer/sec, latency 4321594 usec
Concurrency: 21, throughput: 2.72178 infer/sec, latency 4321594 usec

Benchmark 3

  • instance_group = 1
  • pipeline_para_size = 1
  • tensor_para_size = 2 (2-gpu model sharded beforehand during conversion)
  • disabled dynamic_batching
Results
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 2.05527 infer/sec, latency 481085 usec
Concurrency: 11, throughput: 2.11075 infer/sec, latency 5269870 usec
Concurrency: 21, throughput: 2.11082 infer/sec, latency 10060785 usec
Concurrency: 31, throughput: 2.05526 infer/sec, latency 14850123 usec
Concurrency: 41, throughput: 2.1108 infer/sec, latency 19648539 usec
Concurrency: 51, throughput: 2.05522 infer/sec, latency 19651967 usec
Concurrency: 61, throughput: 2.11073 infer/sec, latency 24426309 usec
Concurrency: 71, throughput: 2.11079 infer/sec, latency 29211689 usec
Concurrency: 81, throughput: 2.11081 infer/sec, latency 34002108 usec
Concurrency: 91, throughput: 2.11082 infer/sec, latency 38790996 usec

Benchmark 4

  • instance_group = 2
  • pipeline_para_size = 1
  • tensor_para_size = 2 (2-gpu model sharded beforehand during conversion)
  • disabled dynamic_batching
Results
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 2.05526 infer/sec, latency 480994 usec
Concurrency: 11, throughput: 2.05526 infer/sec, latency 480994 usec

Benchmark 5

  • instance_group = 1
  • pipeline_para_size = 2
  • tensor_para_size = 1 (1-gpu model not sharded beforehand during conversion)
  • dynamic_batching
Results
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 2.44411 infer/sec, latency 409035 usec
Concurrency: 11, throughput: 10.9983 infer/sec, latency 1029421 usec
Concurrency: 21, throughput: 19.83 infer/sec, latency 1042524 usec
Concurrency: 31, throughput: 29.274 infer/sec, latency 1065358 usec
Concurrency: 41, throughput: 38.1616 infer/sec, latency 1154819 usec
Concurrency: 51, throughput: 38.1616 infer/sec, latency 1154819 usec

Benchmark 6

  • instance_group = 1
  • pipeline_para_size = 1
  • tensor_para_size = 2 (2-gpu model sharded beforehand during conversion)
  • dynamic_batching
Results
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 2.05526 infer/sec, latency 481416 usec
Concurrency: 11, throughput: 10.9984 infer/sec, latency 1017659 usec
Concurrency: 21, throughput: 19.2183 infer/sec, latency 1080090 usec
Concurrency: 31, throughput: 28.1616 infer/sec, latency 1115034 usec
Concurrency: 41, throughput: 35.2704 infer/sec, latency 1146562 usec
Concurrency: 51, throughput: 43.658 infer/sec, latency 1177047 usec
Concurrency: 61, throughput: 50.825 infer/sec, latency 1202761 usec
Concurrency: 71, throughput: 59.1549 infer/sec, latency 1232734 usec
Concurrency: 81, throughput: 64.933 infer/sec, latency 1248621 usec
Concurrency: 91, throughput: 73.5408 infer/sec, latency 1271415 usec

Docs: https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization#concurrent-model-execution

Ref: #126 (closed)

Artifacts

Content of test.json used as input_data for perf_analyzer

{"data": [{"prompt": {"content": ["<python>def is_even(na: int) ->"], "shape": [1]}, "request_output_len": {"content": [32], "shape": [1]}, "temperature": [0.20000000298023224], "repetition_penalty": [1.0], "runtime_top_k": [0], "runtime_top_p": [0.9800000190734863], "start_id": [50256], "end_id": [50256], "random_seed": [1594506369], "is_return_log_probs": [true]}]}
Edited by Tan Le

Merge request reports