2020-10-29: Database failover

Summary

We continue to see degraded performance across services, but CI jobs are the most severely impacted. The underlying cause is the database failover and an increase in load across the database cluster. We're identifying the best way to mitigate.

Root cause explained here.

Timeline

All times UTC.

2020-10-29

21:00 - 21:15 - Incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2936 is being mitigated concurrently and causing unavailability with app metrics
21:00 - Primary patroni-01 rebooted due to "host error" according to GCP's event log and initiates a DB failover to patroni-04 triggering an influx of apdex SLO alerts for several components
Postgres processes start causing CPU saturation
21:20 - Matt S. finds and shares that that the patroni host rebooted and there was a failover
pg_stat_activity table shows multiple long-running queries causing CPU saturation. Unsure whether they're either recently added slow queries, exiting slow queries running more often, or queries with a bad execution plan
Team attempts to find the endpoints causing the long running queries and to potentially block them to allow database to catch up to all the queries
Matt runs analyze on queries to optimize some poor execution plans lacking analytics
22:53 - Apdex for API and runners start gradually returning back to normal levels

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected: Postgres
Team attribution: Datastores
Minutes downtime or degradation: A spike of high error rates for several components in the first 10 minutes of the failover with degraded CI service for ~2 hours.

Metrics

Customer Impact

Who was impacted by this incident? Any customer using affected API endpoints and runners.
What was the customer experience during the incident? Slowness, 503s, CI jobs not executing.
How many customers were affected? All customers using API and runners.
If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

How was the event detected? Influx of alerts, Matt sharing the failover data
How could detection time be improved? Adding alerts to notify when a primary is down or a failover has happened
How did we reach the point where we knew how to mitigate the impact? Analyzing queries that had a bad execution plan.
How could time to mitigation be improved? A runbook or script to refresh statistics

Post Incident Analysis

How was the root cause diagnosed? Root cause and suggestions here
How could time to diagnosis be improved? A runbook to help diagnose
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?) No.

5 Whys

Why didn’t we know a DB failover had happened?
- We had to decipher that since we don't have alerting for when there is a database failover
Why did GCP reboot Patroni-01?
- The instance ran into a "host" error.
Why did we re-use an incident issue?
- We already had 2 incidents open
- CMOC and channel were already active
- Not fully used to using Woodhouse yet, that will change as we get used to it
Why did the automated Analyze take until 2020-10-30 01:33 UTC to complete?
Why does it take us a while to do analysis by hand?
- No quick checklist/commands for an EOC to “check” health of cluster

Lessons Learned

We saw some new behaviors when primary database is unavailable and a failover happens

Corrective Actions

Create a runbook for establishing health of database
Alert on database failover
(per #2937 (comment 445840087) @msmiley) Reconfigure the db-analyze.sh script to use the normal statistics target. This avoids the risk that lead to this incident, but it still does a lot of unnecessary work and may not be fast enough to avoid the spurious alert that the script was created to mitigate. --> https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4555. CA here, #3021 (closed).
(per #2937 (comment 445840087) @msmiley) Replace db-analyze.sh with a new post-failover action that runs VACUUM instead of ANALYZE. Either of those commands will update the "stats collector" counters for live and dead tuples (see here and here), but running just VACUUM avoids modifying the optimizer statistics and also gives a head start on cleaning up the dead tuples (which is the main reason for tracking these counts). CA here: #3024 (closed)

Guidelines

Blameless RCA Guideline

Edited Nov 17, 2020 by Alberto Ramos