Root Cause Analysis : Latency for Code Suggestion

Overview

With a 40%+ increase in traffic on Code Suggestion on May 18th, we observed an increase latency in response rate >10 s and declared an incident. Details of the Incident here: gitlab-com/gl-infra/production#14455 (closed)

Why

Based on the investigation this was an anticipated concern on if the Model Gateway , the client side or the Inference Server ( Triton ) will be able to handle the load. Through the investigation we have found the model gateway , the client side had healthy performance but the CPU loading the Triton inference was impacted.

What went well

Even if this is an area where both AI Assisted and Infra is learning we acted quite quickly to address the below

We increased pods to 12 for model gateway. Before we have 1 pod/deployment

2.We restarted the triton server

We increased the replicas from 1- 4 in the triton server to manage the load
With each replica we added NVIDEA A100 40 GB to the node pool .Currently with this architecture of load balancing it is able to manage latency

Corrective Action

On May 22nd we plan to increase adoption through banner. We will be monitoring ongoing to see the latency impact. And will also start working on the corrective measures stated below. The full work is estimated to be two weeks till end of May 31st,

We continue to work on the corrective action needs from the previous incident gitlab-com/gl-infra/production#14451 (closed). DRI :AI Assisted
We look into further actioning on triton optimisation through model analyzer or GPU optimisation through inference . (ML Infra : Triton Optimization through Model An... (#104 - closed)) . DRI: AI Assisted. 31st May
We continue to monitor cpu -gpu and opt for load-balancing.(GPU/TPU/CPU Load Balancing (#105 - closed)) .DRI : AI Assisted and Infra . 26th May
We write infrastructure as code including terraform scripts , kubectl commands that automated de-bugging.(Infrastructure as Code for GCP Instance (#108 - moved)). DRI: TBD
Add more instrumentation and timing information to model gateway logs (time spent in gitlab auth, time spent in triton). (Refinement of Instrumentation of Model Inferenc... (#107 - closed)). DRI: Infra.
Make model-gateway deployment across different zones like model-triton (Configure model-gateway so it is deployed acros... (#109 - closed)). DRI: Infra
Add request/ rate limit to gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#103. DRI : Infra

Further

Since we will be increasing adoption , we have a better understanding of the Infrastructure to be able to trouble shoot and documented as well: gitlab-com/runbooks!5815 (merged)

Edited May 21, 2023 by Wayne Haber