2020-12-08: SLI of the gitaly service (`cny` stage) has an error rate violating SLO (#3161) · Issues · GitLab.com / GitLab Infrastructure Team / Production

2020-12-08: SLI of the gitaly service (`cny` stage) has an error rate violating SLO

## Summary  More information will be added as we investigate the issue. ## Timeline  Two spikes of error rates coming from cny gitaly service: ![Screen_Shot_2020-12-08_at_10.18.20_AM](/uploads/751cddfffa3e95c8b433720a3ab760ca/Screen_Shot_2020-12-08_at_10.18.20_AM.png) The first one starting around 16:04 and ending around 16:13. The second spike started around 16:59 and ended at around 17:10. Error rates affected `gitlab-org/gitlab` and GRPC calls were `Canceled`. CPU saturation was caused primarily by pack objects and upload pack. All times UTC. 2020-12-08 16:33 - cindy declares incident in Slack. ## Corrective Actions  ----  <br/> <details> <summary><i>Click to expand or collapse the Incident Review section.</i> <br/> # Incident Review </summary> ----  ## Summary  1. Service(s) affected: 1. Team attribution: 1. Time to detection: 1. Minutes downtime or degradation:  ## Metrics  ## Customer Impact 1. **Who was impacted by this incident? (i.e. external customers, internal customers)** 1. ... 2. **What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)** 1. ... 3. **How many customers were affected?** 1. ... 4. **If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?** 1. ... ## What were the root causes? ["5 Whys"](https://en.wikipedia.org/wiki/Five_whys) ## Incident Response Analysis 1. **How was the incident detected?** 1. ... 1. **How could detection time be improved?** 1. ... 1. **How was the root cause diagnosed?** 1. ... 1. **How could time to diagnosis be improved?** 1. ... 1. **How did we reach the point where we knew how to mitigate the impact?** 1. ... 1. **How could time to mitigation be improved?** 1. ... 1. **What went well?** 1. ... ## Post Incident Analysis 1. **Did we have other events in the past with the same root cause?** 1. ... 1. **Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?** 1. ... 1. **Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.** 1. ... ## Lessons Learned  ## Guidelines * [Blameless RCA Guideline](https://about.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis.html#meeting-purpose) ## Resources 1. If the **Situation Zoom room** was utilised, recording will be automatically uploaded to [Incident room Google Drive folder](https://drive.google.com/drive/folders/1wtGTU10-sybbCv1LiHIj2AFEbxizlcks) (private) ## Incident Review Stakeholders  1. 1. 1. </details>

issue