Brief web latency increase caused by a short-lived increase in cpu utilization on database nodes
/label incident IncidentActive
Summary
Brief web latency increase caused by a short-lived increase in cpu utilization on database nodes
A blip on database machines caused a slowdown of transactions leading to an increase in web latency
Timeline
All times UTC.
2020-06-03
- 08:07 - brief blip in a number of services, it seems that the most badly impacted was database
- 12:56 - mwasilewski declares incident in Slack using
/incident declare
command.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected: Postgres, Gitaly, Web
- Team attribution:
- Minutes downtime or degradation: 2m degredation
Metrics
Customer Impact
- Who was impacted by this incident? all gitlab.com users
- What was the customer experience during the incident? web latency was very poor
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Michal Wasilewski