2020-10-13 Blackbox probe for sentry.gitlab.net dropped to 0% success rate
Summary
2020-10-13 Blackbox probe success rate dropped to 0%
Sentry was down for 53 minutes due to a burst of slow queries.
A few hours later, the same thing happened a second time. Reusing this incident to keep the context together.
To be clear, these queries are very expensive regardless of disk I/O latency. The high disk latency noted below just amplifies the problem.
Sentry's disk I/O latency is pretty poor (over 100 ms) during db backups, and these specific queries were expensively randomly accessing a large working set that would likely not have fit entirely in memory, hence adding disk I/O.
Canceling these queries recovered Sentry's availability.
Some potential mitigations are noted as corrective action candidates here: #2821 (comment 429206831)
Timeline
All times UTC.
1st incident
2020-10-13
- 23:42 - Burst of expensive queries are run against the Sentry. Disk I/O latency is poor due to these queries and a concurrently running db backup.
- 23:44 - PagerDuty alert that
sentry.gitlab.net
is failing its blackbox probes: https://gitlab.pagerduty.com/incidents/POFWN58 - 23:50 - msmiley declares incident in Slack using
/incident declare
command. - 00:14 - Found the host running Sentry and its db.
- 00:31 - Found a batch of slow concurrently running SQL queries are causing Sentry to be unresponsive and timeout. Captured those queries for follow-up analysis.
- 00:35 - Sentry is back up. Recovered by killing the batch of queries that had very expensive execution plans. Monitored for a recurrence, and began post-incident analysis of the slow query pattern.
- 00:59 - Resolved incident after 25 minutes of not seeing another batch of expensive queries.
2nd incident (same behavior and remedy)
2020-10-14
- 06:10 - Start of outage. Blackbox probes start to fail. PagerDuty alert that
sentry.gitlab.net
is failing its blackbox probes: https://gitlab.pagerduty.com/incidents/P1VNZSS - 06:27 - End of outage. Canceled queries, allowing Sentry to recover.
- 06:29 - Sentry's blackbox probes are back to 100% successful.
- Service(s) affected: ServiceSentry
- Team attribution: ~"team::Observability"
Corrective Actions
- #2827 (closed) (comment 429927470)
Edited by AnthonySandoval