ML Infra: Scalability: Analysing inference load on Triton with perf_analyser
To test the Model Gateway and the Triton Server, we should come up with some simple scripts that will induce load.
What would be a representative load sample? I was inclined to record some traffic submitted via VSCode, and then generate variations from there.
We probably want to do this on a staging deployment, so let me create a separate issue for that.
Sample prompts and completions
Edited by Alper Akgun