Expose configured concurrency limits for models
Expose configured concurrency limits for models
This allows us to configure limits for a certain engine and model by
setting the environment variable MODEL_ENGINE_CONCURRENCY_LIMITS
with a json value as follows:
{
"vertex-ai": { "code-gecko@002": 7 },
"anthropic": { "claude-2.0": 2, "claude-2.1": 7 }
}
When doing this, we will expose these limits in a Prometheus gauge
called model_inferences_max_concurrent
soon as we've run an
inference for this model. When no value is set for a model, the gauge
is not exposed.
This allows us to combine this with the model_inferences_in_flight
gauge to see how much we're using at the time of sampling.
For gitlab-com/runbooks#143 (closed)
Tried this out by starting the application as follows:
MODEL_ENGINE_CONCURRENCY_LIMITS='{ "vertex-ai": { "code-gecko@002": 3 }}' poetry run ai_gateway
And issuing a request for a suggestion:
curl --request POST --url http://localhost:5052/v2/code/completions --header 'Content-Type: application/json' --header 'X-Gitlab-Authentication-Type: oidc' --data '{
"prompt_version": 1,
"project_path": "awesome_project",
"project_id": 23,
"current_file": {
"file_name": "main.ts",
"content_above_cursor": "\nfunction calculator(a: Number, b: Number, operation: string): number {\n // operation can be +, -, * or /\n \n",
"content_below_cursor": ""
}
}'
This request fails for me, because I don't have the proper auth set up for Vertex, but the metrics are created none the less:
curl -s http://localhost:8082 | grep 'model_inference'
# HELP model_inferences_in_flight The number of in flight inferences running
# TYPE model_inferences_in_flight gauge
model_inferences_in_flight{model_engine="vertex-ai",model_name="code-gecko@002"} 0.0
# HELP model_inferences_max_concurrent The maximum number of inferences we can run concurrently on a model
# TYPE model_inferences_max_concurrent gauge
model_inferences_max_concurrent{model_engine="vertex-ai",model_name="code-gecko@002"} 3.0