2023-05-24: Code Suggestions Latency
Customer Impact
Current Status
We are seeing increased latency and unavailability on the code suggestions service. This corresponds to increase usage in the last 4 hours.
The biggest impact was seen for approximately 2.5 hours, from 08:00 to 010:30UTC
The root cause was increased load that required more nodes to the triton pool. After provisioning the new nodes we noticed they were not receiving traffic. We took some actions following this, the problem was that due long keep-alives configured on the model gateway, no new connections to the new triton servers were established. We resolved this at first by restarting most of the model-gateway pods, then by configuring keep-alives and grpc timeouts in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!100 (merged) .
https://log.gprd.gitlab.net/goto/20319580-fa0e-11ed-a017-0d32180b1390
This is causing increased latency to the service.
https://log.gprd.gitlab.net/goto/bc7b3d00-fa0f-11ed-8afc-c9851e4645c0
Corrective actions
- dynamic batching to make better use of the triton nodes
- add a liveness probe to triton to enable us to auto-scale
- add more visibility into GPU usage
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

