2019-03-18 Second Spike in 5xx on web and api nodes
Summary
A brief summary of what happened. Try to make it as executive-friendly as possible.
Service(s) affected : GitLab.com
Team attribution : None
Minutes downtime or degradation : 45m
Timeline
2019-03-18 (Times in UTC)
- 19:54 - First alert received slack thread
- 18:00 - starting incident
- 18:11 - observing an increase in Postgres timeouts
- 18:16 - Casey Shobe identifies long running queries running
SELECT *
withoutLIMIT
set - 18:30 - we removed all of the problematic queries that were slowing down requests.
- 18:35 - temporarily blocked /explore/projects.json
- 18:39 - GitLab.com back to normal operational levels for the last 10 minutes.
- 18:40 - Identified a potential fix and working on a patch.
Edited by AnthonySandoval