2020-10-21: The `goserver` component of the `gitaly` service, (`cny` stage), has an apdex-score burn rate outside of SLO
<!-- ISSUE TITLING: use the form "YYYY-MM-DD: briefly describe problem" -->
<!-- ISSUE LABELING: Don't forget to add labels for severity (severity::1 - severity::4) and service. if the incident relates to sensitive data, or is security related use the label ~security and mark it confidential. -->
## Summary
<!--
Leave a brief headline remark so that people know what's going on. It's fine for
this to be vague while not much is known.
-->
### 2020-10-21: The `goserver` component of the `gitaly` service, (`cny` stage), has an apdex-score burn rate outside of SLO
The `goserver` component of the `gitaly` service, (`cny` stage), has an apdex-score burn rate outside of SLO
## Timeline
All times UTC.
2020-10-21
- 15:30 - [PagerDuty Alerted](https://gitlab.pagerduty.com/incidents/PPVVV9K)
- 15:35 - cmcfarland declares incident in Slack using `/incident declare` command.
- 15:35 - [PagerDuty Alert Cleared](https://gitlab.pagerduty.com/incidents/PPVVV9K)
- 15:36 - [PagerDuty Alerted](https://gitlab.pagerduty.com/incidents/PORPYTN)
- 16:00 - [PagerDuty Alert Cleared](https://gitlab.pagerduty.com/incidents/PORPYTN)
- 16:30 - Created a [silence](https://alerts.gitlab.net/#/silences/c04a848a-a854-4bdc-8e0a-5ef89bc1543c) and resolved the [PagerDuty Alert](https://gitlab.pagerduty.com/incidents/P7IIK4V)
<!-- THE BELOW IS TO BE CONDUCTED ONCE THE ABOVE INCIDENT IS MITIGATED. TRANSFER DATA FROM THE ABOVE INTO THE INCIDENT REVIEW SECTIONS BELOW. -->
<br/>
<details>
<summary><i>Click to expand or collapse the Incident Review section.</i>
<br/>
<h2>Incident Review</h2>
</summary>
<!--
The purpose of this Incident Review is to serve as a classroom to help us better understand the root causes of an incident. Treating it as a classroom allows us to create the space to let us focus on devising the mechanisms needed to prevent a similar incident from recurring in the future. A root cause can **never be a person** and this Incident Review should be written to refer to the system and the context rather than the specific actors. As placeholders for names, consider the usage of nouns like "technician", "engineer on-call", "developer", etc..
-->
## Summary
<!--
_A brief summary of what happened. Try to make it as executive-friendly as possible._
_example: For a period of 19 minutes (between 2020-05-01 12:00 UTC and 2020-05-01 12:19 UTC), GitLab.com experienced a drop in traffic to the database. 507 customers saw 2,342 503 errors over this 19 minute period. The underlying cause has been determined to be a change to the PgBouncer configuration (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/XXXX) which caused the total number of connections to be reduced to 50. This incident was then mitigated by rolling back this PgBouncer configuration change.
-->
1. Service(s) affected:
1. Team attribution:
1. Minutes downtime or degradation:
<!--
_For calculating duration of event, use the [Platform Metrics Dashboard](https://dashboards.gitlab.net/d/general-triage/general-platform-triage?orgId=1) to look at appdex and SLO violations._
-->
## Metrics
<!--
_Provide any relevant graphs that could help understand the impact of the incident and its dynamics._
-->
## Customer Impact
1. Who was impacted by this incident? (i.e. external customers, internal customers)
2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
3. How many customers were affected?
4. If a precise customer impact number is unknown, what is the estimated potential impact?
## Incident Response Analysis
1. How was the event detected?
2. How could detection time be improved?
3. How did we reach the point where we knew how to mitigate the impact?
4. How could time to mitigation be improved?
## Post Incident Analysis
1. How was the root cause diagnosed?
2. How could time to diagnosis be improved?
3. Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
4. Was this incident triggered by a change (deployment of code or change to infrastructure. _If yes, have you linked the issue which represents the change?_)?
## 5 Whys
<!--
_This section is meant to dig into lessons learned and corrective actions, it is not limited to 5 and consider how you may dive deeper into each why_
_example:_
1. Customers experienced an inability to create new projects on GitLab.com, why?
- A code change was deployed which contained an escaped bug.
1. Why did this bug not get noticed in staging?
- The integration test for this use case is missing.
1. Why is an integration test for this use case missing?
- It was inadvertently removed during a refactoring of our test suite.
1. Why was the test suite being refactored?
- As part of our efforts to decrease MTTP.
1. Why did it take 2 hours to notice this issue in production?
- The initial alert was supressed as a false alarm.
1. Why was this alert suppressed
- The system which dedupes alerts inadvertently suppressed this alarm as a duplicate.
1. Why did it take 4 hours to resolve the issue in production?
- The change which carried this escaped bug also contained a database schema change which made rolling the change back impossible. Engineering was engaged immediately by the oncall SRE and conducted a forward fix.
-->
## Lessons Learned
<!--
_Be explicit about what lessons we learned and should carry forward. These usually inform what our corrective actions should be._
_example:_
1. The results of refactoring activites around our integration tests should be reviewed. (i.e we had 619 tests before refactor but 618 after.)
2. Our tooling to dedupe alarms should have integration tests to ensure it works against existing and newly added alarms.
-->
## Corrective Actions
<!--
- _Use Lessons Learned as a guideline for creation of Corrective Actions
- _List issues that have been created as corrective actions from this incident._
- _For each issue, include the following:_
- _<Bare Issue link> - Issue labeled as ~"corrective action"._
- _Include an estimated date of completion of the corrective action._
- _Include the named individual who owns the delivery of the corrective action._
-->
## Guidelines
- [Blameless RCA Guideline](https://about.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis.html#meeting-purpose)
</details>
issue