2020-11-18: The `workhorse` component of the `api` service, (`cny` stage), has an apdex-score burn rate outside of SLO
Summary
Context will be added here as we investigate.
Timeline
All times UTC.
2020-11-18
-
18:10
- Alert triggered: Theworkhorse
component of theapi
service, (cny
stage), has an apdex-score burn rate outside of SLO: https://gitlab.pagerduty.com/incidents/PBWEXYK -
18:10
- Mistakenly related the triggered alert to a previously declared incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3051 -
18:30
- @skarbek notes that the alert is expected due to us draining canary because of another incident (Select2 images not loading CSS properly for license compliance and GitHub project import: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3052) -
18:32
- @nnelson declares incident in Slack. -
18:40
- Alert resolved: Theworkhorse
component of theapi
service, (cny
stage), has an apdex-score burn rate outside of SLO -
18:47
- Alert triggered: component_apdex_ratio_burn_rate_slo_out_of_bounds_lower: https://gitlab.pagerduty.com/incidents/PISKC9W -
19:12
- Alert resolved: component_apdex_ratio_burn_rate_slo_out_of_bounds_lower
Corrective Actions
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Time to detection:
- Minutes downtime or degradation:
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- ...
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- ...
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
("5 Whys")
Incident Response Analysis
-
How was the incident detected?
- ...
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
Guidelines
Edited by Nels Nelson