2021-01-12: The server SLI of the registry service (`main` stage) has an apdex violating SLO - Google storage related

Summary

A change made to Google Storage caused an elevated number of error responses when storage was accessed. The change has been reverted.

Timeline

All times UTC.

2021-01-12

19:32 - Registry storage Apdex begins to drop
19:49 - cmcfarland declares incident in Slack.
19:53 - Registry storage Apdex recovers

2021-01-14

00:36 - Registry storage Apdex begins to drop
01:05 - Registry storage Apdex recovers

Corrective Actions

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12348

Incident Review

Summary

Service(s) affected: registry (and possibly other GCS based services)
Team attribution: Infrastructure
Time to detection: 15m
Minutes downtime or degradation: 15m for the initial incident

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Registry users such as internal and external customers.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Slow registry access and change requests.
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Around 5000 requests for the initial incident.

What were the root causes?

"5 Whys"

Incident Response Analysis

How was the incident detected?
1. Pagerduty notified ECO of a drop in Registry latency apdex.
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. The storage SLI showed that storage transactions were slower than normal.
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

Lessons Learned

This appears to be caused and solved by changes made by our provider. Our support access has made this easier to dig into provider issues. Monitoring our provider better and having good communication seem to be the real solution for these kinds of incidents.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Feb 18, 2021 by Alberto Ramos