2020-12-08: SLI of the gitaly service (`cny` stage) has an error rate violating SLO
<!-- ISSUE TITLING: use the form "YYYY-MM-DD: briefly describe problem" --> <!-- ISSUE LABELING: Don't forget to add labels for severity (S1 - S4) and service. if the incident relates to sensitive data, or is security related use the label ~security and mark it confidential. --> ## Summary <!-- Leave a brief headline remark so that people know what's going on. It is perfectly acceptable for this to be vague while not much is known. --> More information will be added as we investigate the issue. ## Timeline <!-- Try to capture in this section, among other events: - Time estimation for when the errors started - typically before the incident was declared. - When the incident was declared. - If other teams had to be engaged, when the right Subject Matter Expert (SME) - able to effectively work on the incident mitigation - was engaged. - When the CMOC sent first comms for this incident. - When the incident was mitigated. - When the incident was fully resolved. - A link to the original PagerDuty incident page, if any. - ... --> Two spikes of error rates coming from cny gitaly service: ![Screen_Shot_2020-12-08_at_10.18.20_AM](/uploads/751cddfffa3e95c8b433720a3ab760ca/Screen_Shot_2020-12-08_at_10.18.20_AM.png) The first one starting around 16:04 and ending around 16:13. The second spike started around 16:59 and ended at around 17:10. Error rates affected `gitlab-org/gitlab` and GRPC calls were `Canceled`. CPU saturation was caused primarily by pack objects and upload pack. All times UTC. 2020-12-08 16:33 - cindy declares incident in Slack. ## Corrective Actions <!-- - _List issues that have been created as corrective actions from this incident._ - _For each issue, include the following:_ - _<Bare Issue link> - Issue labeled as ~"corrective action"._ - _Include an estimated date of completion of the corrective action._ - _Include the named individual who owns the delivery of the corrective action._ - _If an incident review was completed, use Lessons Learned as a guideline for creation of Corrective Actions_ --> ---- <!-- THE BELOW IS TO BE CONDUCTED ONCE THE ABOVE INCIDENT IS MITIGATED. TRANSFER DATA FROM THE ABOVE INTO THE INCIDENT REVIEW SECTIONS BELOW. --> <br/> <details> <summary><i>Click to expand or collapse the Incident Review section.</i> <br/> # Incident Review </summary> ---- <!-- The purpose of this Incident Review is to serve as a classroom to help us better understand the root causes of an incident. Treating it as a classroom allows us to create the space to let us focus on devising the mechanisms needed to prevent a similar incident from recurring in the future. A root cause can **never be a person** and this Incident Review should be written to refer to the system and the context rather than the specific actors. As placeholders for names, consider the usage of nouns like "technician", "engineer on-call", "developer", etc.. --> ## Summary <!-- _A brief summary of what happened. Try to make it as executive-friendly as possible._ _example: For a period of 19 minutes (between 2020-05-01 12:00 UTC and 2020-05-01 12:19 UTC), GitLab.com experienced a drop in traffic to the database. 507 customers saw 2,342 503 errors over this 19 minute period. The underlying cause has been determined to be a change to the PgBouncer configuration (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/XXXX) which caused the total number of connections to be reduced to 50. This incident was then mitigated by rolling back this PgBouncer configuration change. --> 1. Service(s) affected: 1. Team attribution: 1. Time to detection: 1. Minutes downtime or degradation: <!-- _For calculating duration of event, use the [Platform Metrics Dashboard](https://dashboards.gitlab.net/d/general-triage/general-platform-triage?orgId=1) to look at appdex and SLO violations._ --> ## Metrics <!-- _Provide any relevant graphs that could help understand the impact of the incident and its dynamics._ --> ## Customer Impact 1. **Who was impacted by this incident? (i.e. external customers, internal customers)** 1. ... 2. **What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)** 1. ... 3. **How many customers were affected?** 1. ... 4. **If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?** 1. ... ## What were the root causes? ["5 Whys"](https://en.wikipedia.org/wiki/Five_whys) ## Incident Response Analysis 1. **How was the incident detected?** 1. ... 1. **How could detection time be improved?** 1. ... 1. **How was the root cause diagnosed?** 1. ... 1. **How could time to diagnosis be improved?** 1. ... 1. **How did we reach the point where we knew how to mitigate the impact?** 1. ... 1. **How could time to mitigation be improved?** 1. ... 1. **What went well?** 1. ... ## Post Incident Analysis 1. **Did we have other events in the past with the same root cause?** 1. ... 1. **Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?** 1. ... 1. **Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.** 1. ... ## Lessons Learned <!-- _Be explicit about what lessons we learned and should carry forward. These usually inform what our corrective actions should be._ _example:_ 1. The results of refactoring activities around our integration tests should be reviewed. (i.e we had 619 tests before refactor but 618 after.) 2. Our tooling to dedupe alarms should have integration tests to ensure it works against existing and newly added alarms. --> ## Guidelines * [Blameless RCA Guideline](https://about.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis.html#meeting-purpose) ## Resources 1. If the **Situation Zoom room** was utilised, recording will be automatically uploaded to [Incident room Google Drive folder](https://drive.google.com/drive/folders/1wtGTU10-sybbCv1LiHIj2AFEbxizlcks) (private) ## Incident Review Stakeholders <!-- "Immediately following the incident: The incident review is started in the original incident issue and the EOC and IMOC are assigned. IMOC and EOC invite stakeholders for involvement in authoring the incident review via an @ mention of their GitLab handle in the incident issue." https://about.gitlab.com/handbook/engineering/infrastructure/incident-review/#incident-review-timeline - @ mention any additional stakeholders below. This could include engineers, engineering managers, engineering directors, quality managers and directors, product managers, technical account managers etc. - Use the product category page (https://about.gitlab.com/handbook/product/product-categories/) to find appropriate stakeholders and the org chart (https://about.gitlab.com/company/team/org-chart/) to find line management representation. - Please ensure that director level management are included on S1 incidents, and let them know that representation is mandatory. --> 1. 1. 1. </details>
issue