Split model inference and post processing timing on code completions (!479) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / AI Gateway

Tan Le requested to merge 360-split-inference-post-processing-timing into main Dec 05, 2023

What does this merge request do and why?

Before this change, inference_duration_s in log and code_suggestions_inference_duration_seconds Prometheus metrics include both model inference and post processing time. This makes troubleshooting latency bottleneck difficult.

This MR moves post processing step to use case class and introduce a generic benchmark wrapper to ease tracking of durations in the future.

How to set up and validate locally

Check out to this merge request's branch.

Ensure a local Docker image built successfully.

docker buildx build --platform linux/amd64 \
  -t code-suggestions-api:dev .

Run a local service on Docker with additional metrics port mapping.

docker run --platform linux/amd64 --rm \
  -p 8082:8082 \
  -p 5052:5052 \
  -e AUTH_BYPASS_EXTERNAL=true \
  -v $PWD:/app -it code-suggestions-api:dev

Make a cURL request to the v2/code/completions endpoint.

curl --request POST \
  --url http://codesuggestions.gdk.test:5052/v2/code/completions \
  --header 'Content-Type: application/json' \
  --header 'X-Gitlab-Authentication-Type: oidc' \
  --data '{
  "prompt_version": 1,
  "project_path": "awesome_project",
  "project_id": 23,
  "current_file": {
    "file_name": "main.ts",
    "content_above_cursor": "\nfunction calculator(a: Number, b: Number, operation: string): number {\n    // operation can be +, -, * or /\n    \n",
    "content_below_cursor": ""
  }
}'

Verify post_processing_duration_s in log (formatted for legibility).

{
  "url": "http://codesuggestions.gdk.test:5052/v2/code/completions",
  "path": "/v2/code/completions",
  "status_code": 200,
  "method": "POST",
  "correlation_id": "050aa158234542938cf61d87eb2f2fba",
  "http_version": "1.1",
  "client_ip": "172.17.0.1",
  "client_port": 60590,
  "duration_s": 1.7459166619955795,
  "cpu_s": 0.3463242130000026,
  "user_agent": null,
  "gitlab_instance_id": null,
  "gitlab_global_user_id": null,
  "gitlab_host_name": null,
  "gitlab_saas_namespace_ids": null,
  "gitlab_realm": null,
  "model_engine": "vertex-ai",
  "model_name": "code-gecko@latest",
  "prompt_length": 182,
  "prompt_length_stripped": 142,
  "inference_duration_s": 1.6582754560004105,
  "suffix_length": 0,
  "prompt_symbols": {
    "comment": 2
  },
  "experiments": [
    {
      "name": "exp_truncate_suffix",
      "variant": 0
    }
  ],
  "lang": "ts",
  "editor_lang": null,
  "model_output_length": 144,
  "model_output_length_stripped": 90,
  "model_output_score": -3.972966194152832,
  "blocked": false,
  "post_processing_duration_s": 0.009877080003207084,
  "logger": "api.access",
  "level": "info",
  "type": "mlops",
  "stage": "main",
  "timestamp": "2023-12-05T05:41:37.234534Z",
  "message": "172.17.0.1:60590 - \"POST /v2/code/completions HTTP/1.1\" 200"
  }

Verify metrics are captured.

❯ curl -s http://localhost:8999/metrics | rg "(code_suggestions_post_processing_duration_seconds)"
# HELP code_suggestions_post_processing_duration_seconds Duration of post processing in seconds
# TYPE code_suggestions_post_processing_duration_seconds histogram
code_suggestions_post_processing_duration_seconds_bucket{le="0.005",model_engine="vertex-ai",model_name="code-gecko@latest"} 0.0
code_suggestions_post_processing_duration_seconds_bucket{le="0.01",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="0.025",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="0.05",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="0.075",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="0.1",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="0.25",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="0.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="0.75",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="1.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="2.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="5.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="7.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="10.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_bucket{le="+Inf",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_count{model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
code_suggestions_post_processing_duration_seconds_sum{model_engine="vertex-ai",model_name="code-gecko@latest"} 0.009877080003207084
# HELP code_suggestions_post_processing_duration_seconds_created Duration of post processing in seconds
# TYPE code_suggestions_post_processing_duration_seconds_created gauge
code_suggestions_post_processing_duration_seconds_created{model_engine="vertex-ai",model_name="code-gecko@latest"} 1.7017548972217996e+09

Merge request checklist

Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Closes #360 (closed)

Edited Dec 06, 2023 by Tan Le

Split model inference and post processing timing on code completions

What does this merge request do and why?

How to set up and validate locally

Merge request checklist

Merge request reports