feat(saturation): add max_concurrent_inferences
feat(saturation): add max_concurrent_inferences
This adds the max_concurrent_inferences
saturation points that will
measure the currently in flight requests to an LLM compared to the
imposed limits.
At the time of writing only Anthropic is enforcing limits this way, and we're already in the process of requesting an increase.
The metrics are emitted from the application, the limits are configured in a vault secret that gets loaded into the environment by Runway. Documentation for this will be added in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#391 (closed)
For #143 (closed)