Skip to content

feat(saturation): add max_concurrent_inferences

Bob Van Landuyt requested to merge bvl/llm-concurrency-saturation into master

feat(saturation): add max_concurrent_inferences

This adds the max_concurrent_inferences saturation points that will measure the currently in flight requests to an LLM compared to the imposed limits.

At the time of writing only Anthropic is enforcing limits this way, and we're already in the process of requesting an increase.

The metrics are emitted from the application, the limits are configured in a vault secret that gets loaded into the environment by Runway. Documentation for this will be added in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#391 (closed)

For #143 (closed)

Merge request reports