2020-10-22: The `mainHttpServices` component of the `frontend` service, (`main` stage), has an apdex-score burn rate outside of SLO
Summary
mainHttpServices
component of the frontend
service, (main
stage), has an apdex-score burn rate outside of SLO
2020-10-22: The The mainHttpServices
component of the frontend
service, (main
stage), has an apdex-score burn rate outside of SLO
Timeline
All times UTC.
2020-10-22
- 17:27 - GPRD Deploy Started
- 18:44 - Frontend Apdex SLO drops below 6h threshold
- 18:46 - Frontend Apdex SLO drops below 1h threshold
- 19:02 - PagerDuty Alerted
- 19:04 - cmcfarland declares incident in Slack using
/incident declare
command. - 19:12 - Frontend Apdex SLO recovers above 1h threshold
- 19:13 - Frontend Apdex SLO recovers above 6h threshold
- 19:17 - PagerDuty Alert Cleared
Incident Review
Summary
During the production deploy, several new instances of the https-git service in Kubernetes were deployed and older ones were removed. Similar to the earlier incident, we saw an increase in web slowness. But unlike the earlier incident, changes had been made to keep more nodes up during the deploy. This still did not seem to completely solve the issue and we, again, saw more slow web responses for https-git traffic.
- Service(s) affected: https-git traffic
- Team attribution: Delivery
- Minutes downtime or degradation: 29 minutes
Metrics
Customer Impact
- Who was impacted by this incident? Any user of https-git service during the incident
- What was the customer experience during the incident? Slow GIT requests/transactions
- How many customers were affected? Unknown
- If a precise customer impact number is unknown, what is the estimated potential impact? During the incident, there was a marked increase in workhorse transactions to git taking longer than .5 seconds. 439,688 transactions during that window above .5 seconds out of a total of 1,582,267. For the incident, 28% of https-git users saw a slower than normal transaction.
Incident Response Analysis
- How was the event detected? PagerDuty alerted on a slow frontend apdex
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact? N/A
- How could time to mitigation be improved? I think that due to the nature of this problem, the only real solution is to make a change and see if next deploy has similar issues.
Post Incident Analysis
- How was the root cause diagnosed? Jarv was available and saw the pattern before the EOC and IMOC did
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?