Root Cause Analysis : Latency for Code Suggestion
Overview
With a 40%+ increase in traffic on Code Suggestion on May 18th, we observed an increase latency in response rate >10 s and declared an incident. Details of the Incident here: gitlab-com/gl-infra/production#14455 (closed)
Why
Based on the investigation this was an anticipated concern on if the Model Gateway , the client side or the Inference Server ( Triton ) will be able to handle the load. Through the investigation we have found the model gateway , the client side had healthy performance but the CPU loading the Triton inference was impacted.
What went well
Even if this is an area where both AI Assisted and Infra is learning we acted quite quickly to address the below
- We increased pods to 12 for model gateway. Before we have 1 pod/deployment
2.We restarted the triton server
-
We increased the replicas from 1- 4 in the triton server to manage the load
-
With each replica we added NVIDEA A100 40 GB to the node pool .Currently with this architecture of load balancing it is able to manage latency
Corrective Action
On May 22nd we plan to increase adoption through banner. We will be monitoring ongoing to see the latency impact. And will also start working on the corrective measures stated below. The full work is estimated to be two weeks till end of May 31st,
-
We continue to work on the corrective action needs from the previous incident gitlab-com/gl-infra/production#14451 (closed). DRI :AI Assisted
-
We look into further actioning on triton optimisation through model analyzer or GPU optimisation through inference . (ML Infra : Triton Optimization through Model An... (#104 - closed)) . DRI: AI Assisted. 31st May
-
We continue to monitor cpu -gpu and opt for load-balancing.(GPU/TPU/CPU Load Balancing (#105 - closed)) .DRI : AI Assisted and Infra . 26th May
-
We write infrastructure as code including terraform scripts , kubectl commands that automated de-bugging.(Infrastructure as Code for GCP Instance (#108 - moved)). DRI: TBD
-
Add more instrumentation and timing information to model gateway logs (time spent in gitlab auth, time spent in triton). (Refinement of Instrumentation of Model Inferenc... (#107 - closed)). DRI: Infra.
-
Make
model-gateway
deployment across different zones likemodel-triton
(Configure model-gateway so it is deployed acros... (#109 - closed)). DRI: Infra -
Add request/ rate limit to
gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#103
. DRI : Infra
Further
Since we will be increasing adoption , we have a better understanding of the Infrastructure to be able to trouble shoot and documented as well: gitlab-com/runbooks!5815 (merged)