Split model inference and post processing timing on code completions
What does this merge request do and why?
Before this change, inference_duration_s
in log and
code_suggestions_inference_duration_seconds
Prometheus metrics include
both model inference and post processing time. This makes
troubleshooting latency bottleneck difficult.
This MR moves post processing step to use case class and introduce a generic benchmark wrapper to ease tracking of durations in the future.
How to set up and validate locally
-
Check out to this merge request's branch.
-
Ensure a local Docker image built successfully.
docker buildx build --platform linux/amd64 \ -t code-suggestions-api:dev .
-
Run a local service on Docker with additional metrics port mapping.
docker run --platform linux/amd64 --rm \ -p 8082:8082 \ -p 5052:5052 \ -e AUTH_BYPASS_EXTERNAL=true \ -v $PWD:/app -it code-suggestions-api:dev
-
Make a cURL request to the
v2/code/completions
endpoint.curl --request POST \ --url http://codesuggestions.gdk.test:5052/v2/code/completions \ --header 'Content-Type: application/json' \ --header 'X-Gitlab-Authentication-Type: oidc' \ --data '{ "prompt_version": 1, "project_path": "awesome_project", "project_id": 23, "current_file": { "file_name": "main.ts", "content_above_cursor": "\nfunction calculator(a: Number, b: Number, operation: string): number {\n // operation can be +, -, * or /\n \n", "content_below_cursor": "" } }'
-
Verify
post_processing_duration_s
in log (formatted for legibility).{ "url": "http://codesuggestions.gdk.test:5052/v2/code/completions", "path": "/v2/code/completions", "status_code": 200, "method": "POST", "correlation_id": "050aa158234542938cf61d87eb2f2fba", "http_version": "1.1", "client_ip": "172.17.0.1", "client_port": 60590, "duration_s": 1.7459166619955795, "cpu_s": 0.3463242130000026, "user_agent": null, "gitlab_instance_id": null, "gitlab_global_user_id": null, "gitlab_host_name": null, "gitlab_saas_namespace_ids": null, "gitlab_realm": null, "model_engine": "vertex-ai", "model_name": "code-gecko@latest", "prompt_length": 182, "prompt_length_stripped": 142, "inference_duration_s": 1.6582754560004105, "suffix_length": 0, "prompt_symbols": { "comment": 2 }, "experiments": [ { "name": "exp_truncate_suffix", "variant": 0 } ], "lang": "ts", "editor_lang": null, "model_output_length": 144, "model_output_length_stripped": 90, "model_output_score": -3.972966194152832, "blocked": false, "post_processing_duration_s": 0.009877080003207084, "logger": "api.access", "level": "info", "type": "mlops", "stage": "main", "timestamp": "2023-12-05T05:41:37.234534Z", "message": "172.17.0.1:60590 - \"POST /v2/code/completions HTTP/1.1\" 200" }
-
Verify metrics are captured.
❯ curl -s http://localhost:8999/metrics | rg "(code_suggestions_post_processing_duration_seconds)" # HELP code_suggestions_post_processing_duration_seconds Duration of post processing in seconds # TYPE code_suggestions_post_processing_duration_seconds histogram code_suggestions_post_processing_duration_seconds_bucket{le="0.005",model_engine="vertex-ai",model_name="code-gecko@latest"} 0.0 code_suggestions_post_processing_duration_seconds_bucket{le="0.01",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="0.025",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="0.05",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="0.075",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="0.1",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="0.25",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="0.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="0.75",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="1.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="2.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="5.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="7.5",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="10.0",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_bucket{le="+Inf",model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_count{model_engine="vertex-ai",model_name="code-gecko@latest"} 1.0 code_suggestions_post_processing_duration_seconds_sum{model_engine="vertex-ai",model_name="code-gecko@latest"} 0.009877080003207084 # HELP code_suggestions_post_processing_duration_seconds_created Duration of post processing in seconds # TYPE code_suggestions_post_processing_duration_seconds_created gauge code_suggestions_post_processing_duration_seconds_created{model_engine="vertex-ai",model_name="code-gecko@latest"} 1.7017548972217996e+09
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Closes #360 (closed)
Edited by Tan Le