2021-01-12: The server SLI of the registry service (`main` stage) has an apdex violating SLO - Google storage related
Summary
A change made to Google Storage caused an elevated number of error responses when storage was accessed. The change has been reverted.
Timeline
All times UTC.
2021-01-12
- 19:32 - Registry storage Apdex begins to drop
- 19:49 - cmcfarland declares incident in Slack.
- 19:53 - Registry storage Apdex recovers
2021-01-14
- 00:36 - Registry storage Apdex begins to drop
- 01:05 - Registry storage Apdex recovers
Corrective Actions
Incident Review
Summary
- Service(s) affected: registry (and possibly other GCS based services)
- Team attribution: Infrastructure
- Time to detection: 15m
- Minutes downtime or degradation: 15m for the initial incident
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Registry users such as internal and external customers.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Slow registry access and change requests.
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Around 5000 requests for the initial incident.
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- Pagerduty notified ECO of a drop in Registry latency apdex.
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- The storage SLI showed that storage transactions were slower than normal.
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
This appears to be caused and solved by changes made by our provider. Our support access has made this easier to dig into provider issues. Monitoring our provider better and having good communication seem to be the real solution for these kinds of incidents.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Incident Review Stakeholders
Edited by Alberto Ramos