Skip to content

ML Infra: Scalability: Analysing inference load on Triton with perf_analyser

To test the Model Gateway and the Triton Server, we should come up with some simple scripts that will induce load.

What would be a representative load sample? I was inclined to record some traffic submitted via VSCode, and then generate variations from there.

We probably want to do this on a staging deployment, so let me create a separate issue for that.

Sample prompts and completions

code-suggestion-prompt-and-completions.json

Edited by Alper Akgun