Skip to content

Split model inference and post processing timing on code completions

Tan Le requested to merge 360-split-inference-post-processing-timing into main

What does this merge request do and why?

Before this change, inference_duration_s in log and code_suggestions_inference_duration_seconds Prometheus metrics include both model inference and post processing time. This makes troubleshooting latency bottleneck difficult.

This MR moves post processing step to use case class and introduce a generic benchmark wrapper to ease tracking of durations in the future.

How to set up and validate locally

  1. Check out to this merge request's branch.

  2. Ensure a local Docker image built successfully.

    docker buildx build --platform linux/amd64 \
      -t code-suggestions-api:dev .
  3. Run a local service on Docker with additional metrics port mapping.

    docker run --platform linux/amd64 --rm \
      -p 8082:8082 \
      -p 5052:5052 \
      -e AUTH_BYPASS_EXTERNAL=true \
      -v $PWD:/app -it code-suggestions-api:dev
  4. Make a cURL request to the v2/code/completions endpoint.

    curl --request POST \
      --url http://codesuggestions.gdk.test:5052/v2/code/completions \
      --header 'Content-Type: application/json' \
      --header 'X-Gitlab-Authentication-Type: oidc' \
      --data '{
      "prompt_version": 1,
      "project_path": "awesome_project",
      "project_id": 23,
      "current_file": {
        "file_name": "main.ts",
        "content_above_cursor": "\nfunction calculator(a: Number, b: Number, operation: string): number {\n    // operation can be +, -, * or /\n    \n",
        "content_below_cursor": ""
      }
    }'
  5. Verify post_processing_duration_s in log (formatted for legibility).

    {
      "url": "http://codesuggestions.gdk.test:5052/v2/code/completions",
      "path": "/v2/code/completions",
      "status_code": 200,
      "method": "POST",
      "correlation_id": "050aa158234542938cf61d87eb2f2fba",
      "http_version": "1.1",
      "client_ip": "172.17.0.1",
      "client_port": 60590,
      "duration_s": 1.7459166619955795,
      "cpu_s": 0.3463242130000026,
      "user_agent": null,
      "gitlab_instance_id": null,
      "gitlab_global_user_id": null,
      "gitlab_host_name": null,
      "gitlab_saas_namespace_ids": null,
      "gitlab_realm": null,
      "model_engine": "vertex-ai",
      "model_name": "code-gecko@latest",
      "prompt_length": 182,
      "prompt_length_stripped": 142,
      "inference_duration_s": 1.6582754560004105,
      "suffix_length": 0,
      "prompt_symbols": {
        "comment": 2
      },
      "experiments": [
        {
          "name": "exp_truncate_suffix",
          "variant": 0
        }
      ],
      "lang": "ts",
      "editor_lang": null,
      "model_output_length": 144,
      "model_output_length_stripped": 90,
      "model_output_score": -3.972966194152832,
      "blocked": false,
      "post_processing_duration_s": 0.009877080003207084,
      "logger": "api.access",
      "level": "info",
      "type": "mlops",
      "stage": "main",
      "timestamp": "2023-12-05T05:41:37.234534Z",
      "message": "172.17.0.1:60590 - \"POST /v2/code/completions HTTP/1.1\" 200"
      }
  6. Verify metrics are captured.

    ❯ curl -s http://localhost:8999/metrics | rg "(code_suggestions_post_processing_duration_seconds)"
    # HELP code_suggestions_post_processing_duration_seconds Duration of post processing in seconds
    # TYPE code_suggestions_post_processing_duration_seconds histogram
    code_suggestions_post_processing_duration_seconds_bucket{le="0.005",model_engine="vertex-ai",model_name="code-gecko@latest"} 0.0
    code_suggestions_post_processing_duration_seconds_bucket{le="0.01",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="0.025",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="0.05",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="0.075",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="0.1",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="0.25",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="0.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="0.75",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="1.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="2.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="5.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="7.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="10.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_bucket{le="+Inf",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_count{model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0
    code_suggestions_post_processing_duration_seconds_sum{model_engine="vertex-ai",model_name="code-gecko@latest"} 0.009877080003207084
    # HELP code_suggestions_post_processing_duration_seconds_created Duration of post processing in seconds
    # TYPE code_suggestions_post_processing_duration_seconds_created gauge
    code_suggestions_post_processing_duration_seconds_created{model_engine="vertex-ai",model_name="code-gecko@latest"} 1.7017548972217996e+09

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.

Closes #360 (closed)

Edited by Tan Le

Merge request reports