Expose configured concurrency limits for models (!522) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / AI Gateway

Bob Van Landuyt requested to merge bvl/expose-model-concurrency-limits into main Dec 20, 2023

Expose configured concurrency limits for models

This allows us to configure limits for a certain engine and model by setting the environment variable MODEL_ENGINE_CONCURRENCY_LIMITS with a json value as follows:

{
  "vertex-ai": { "code-gecko@002": 7 },
  "anthropic": { "claude-2.0": 2, "claude-2.1": 7 }
}

When doing this, we will expose these limits in a Prometheus gauge called model_inferences_max_concurrent soon as we've run an inference for this model. When no value is set for a model, the gauge is not exposed.

This allows us to combine this with the model_inferences_in_flight gauge to see how much we're using at the time of sampling.

For gitlab-com/runbooks#143 (closed)

Tried this out by starting the application as follows:

MODEL_ENGINE_CONCURRENCY_LIMITS='{ "vertex-ai": { "code-gecko@002": 3 }}' poetry run ai_gateway

And issuing a request for a suggestion:

curl --request POST   --url http://localhost:5052/v2/code/completions   --header 'Content-Type: application/json'   --header 'X-Gitlab-Authentication-Type: oidc'   --data '{
  "prompt_version": 1,
  "project_path": "awesome_project",
  "project_id": 23,
  "current_file": {
    "file_name": "main.ts",
    "content_above_cursor": "\nfunction calculator(a: Number, b: Number, operation: string): number {\n    // operation can be +, -, * or /\n    \n",
    "content_below_cursor": ""
  }
}'

This request fails for me, because I don't have the proper auth set up for Vertex, but the metrics are created none the less:

curl -s http://localhost:8082 | grep 'model_inference'
# HELP model_inferences_in_flight The number of in flight inferences running
# TYPE model_inferences_in_flight gauge
model_inferences_in_flight{model_engine="vertex-ai",model_name="code-gecko@002"} 0.0
# HELP model_inferences_max_concurrent The maximum number of inferences we can run concurrently on a model
# TYPE model_inferences_max_concurrent gauge
model_inferences_max_concurrent{model_engine="vertex-ai",model_name="code-gecko@002"} 3.0

Edited Dec 20, 2023 by Bob Van Landuyt

Expose configured concurrency limits for models

Merge request reports