2020-10-29: Database failover
Summary
We continue to see degraded performance across services, but CI jobs are the most severely impacted. The underlying cause is the database failover and an increase in load across the database cluster. We're identifying the best way to mitigate.
Timeline
All times UTC.
2020-10-29
- 21:00 - 21:15 - Incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2936 is being mitigated concurrently and causing unavailability with app metrics
- 21:00 - Primary
patroni-01
rebooted due to "host error" according to GCP's event log and initiates a DB failover topatroni-04
triggering an influx of apdex SLO alerts for several components - Postgres processes start causing CPU saturation
- 21:20 - Matt S. finds and shares that that the patroni host rebooted and there was a failover
-
pg_stat_activity
table shows multiple long-running queries causing CPU saturation. Unsure whether they're either recently added slow queries, exiting slow queries running more often, or queries with a bad execution plan - Team attempts to find the endpoints causing the long running queries and to potentially block them to allow database to catch up to all the queries
- Matt runs analyze on queries to optimize some poor execution plans lacking analytics
- 22:53 - Apdex for API and runners start gradually returning back to normal levels
Incident Review
Summary
- Service(s) affected: Postgres
- Team attribution: Datastores
- Minutes downtime or degradation: A spike of high error rates for several components in the first 10 minutes of the failover with degraded CI service for ~2 hours.
Metrics
Customer Impact
- Who was impacted by this incident? Any customer using affected API endpoints and runners.
- What was the customer experience during the incident? Slowness, 503s, CI jobs not executing.
- How many customers were affected? All customers using API and runners.
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected? Influx of alerts, Matt sharing the failover data
- How could detection time be improved? Adding alerts to notify when a primary is down or a failover has happened
- How did we reach the point where we knew how to mitigate the impact? Analyzing queries that had a bad execution plan.
- How could time to mitigation be improved? A runbook or script to refresh statistics
Post Incident Analysis
- How was the root cause diagnosed? Root cause and suggestions here
- How could time to diagnosis be improved? A runbook to help diagnose
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?) No.
5 Whys
- Why didn’t we know a DB failover had happened?
- We had to decipher that since we don't have alerting for when there is a database failover
- Why did GCP reboot
Patroni-01
?- The instance ran into a "host" error.
- Why did we re-use an incident issue?
- We already had 2 incidents open
- CMOC and channel were already active
- Not fully used to using Woodhouse yet, that will change as we get used to it
- Why did the automated Analyze take until 2020-10-30 01:33 UTC to complete?
- Why does it take us a while to do analysis by hand?
- No quick checklist/commands for an EOC to “check” health of cluster
Lessons Learned
- We saw some new behaviors when primary database is unavailable and a failover happens
Corrective Actions
- Create a runbook for establishing health of database
- Alert on database failover
- (per #2937 (comment 445840087) @msmiley) Reconfigure the
db-analyze.sh
script to use the normal statistics target. This avoids the risk that lead to this incident, but it still does a lot of unnecessary work and may not be fast enough to avoid the spurious alert that the script was created to mitigate. --> https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4555. CA here, #3021 (closed). - (per #2937 (comment 445840087) @msmiley) Replace
db-analyze.sh
with a new post-failover action that runsVACUUM
instead ofANALYZE
. Either of those commands will update the "stats collector" counters for live and dead tuples (see here and here), but running justVACUUM
avoids modifying the optimizer statistics and also gives a head start on cleaning up the dead tuples (which is the main reason for tracking these counts). CA here: #3024 (closed)
Guidelines
Edited by Alberto Ramos