2021-01-17 - database failover triggered by GCP snapshots

Summary

During implementation of #2648 (closed), snapshots were initiated on all database servers concurrently, under the mistaken impression that the snapshot process was a transparent background transaction via GCP API, and would not impact the running nodes.

Timeline

All times UTC.

2021-01-16

23:36 - @craig initiates snapshots on all patroni nodes
23:38 - begin seeing error spikes
23:39 - multiple alerts received for service disruption
23:40 - patroni invalidates the lock by patroni-06 and begins failover
23:41 - patroni-03 becomes the new primary
23:42 - increased error rates fall off
23:47 - @dawsmith reports system degradation, and cancellation of some snapshot requests
23:56 - @craig declares incident in Slack.

Corrective Actions

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12355 - Update Patroni/Postgres runbooks to clarify GCP snapshot best practices.
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12363 - Add checklist for change management reviewers and approvers to runbooks with reference in handbook
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12518 - Schedule call with Google to talk more about IO sensitivites.

Incident Review

Summary

Service(s) affected: ServicePatroni
Team attribution: teamReliability
Time to detection: 1-2 minutes
Minutes downtime or degradation: 4-5 minutes

Metrics

{{https://gitlab.com/gitlab-com/gl-infra/production/uploads/717d15e470cb6b71217014c686640ea5/Screen_Shot_2021-01-16_at_4.48.01_PM.png}} {{https://gitlab.com/gitlab-com/gl-infra/production/uploads/90fc837f6035d69dcbce71bca9ac07bc/Screen_Shot_2021-01-16_at_4.47.51_PM.png}}

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All active users during the incident
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Some would have received 5xx errors for transactions requiring database writes
2. Page refreshes spot checked during the incident responded normally
How many customers were affected?
1. Unknown, see below for details on requests vs error rates
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Full thread discussing impact: #3338 (comment 487128596)
2. Rails
  1. approximately 2000 5xx errors with a spike approximately between 23:38:30 and 23:42:00
  2. 1,521,180 requests logged. so a poor calc 2019 / 1,521,180 =~ 0.133% error rate
3. Workhorse
  1. 13757 / 481335 =~ 2.86 % error rate
4. Aggregate
  1. If we zoom in to just the 1hr from 23:00 to 00:00UTC, .78 availability points were used.

What were the root causes?

"5 Whys"

Why did we see errors across the fleet?
1. The primary patroni node initiated a failover
Why did patroni initiate a failover?
1. The data disk experienced excessive IO latency
2. The node experienced a loss of network connectivity to consul at the same time, either/both prompting the failover
Why did the node experience high IO latency / lose network connectivity
1. A snapshot initiated on the Postgres data volume caused a massive spike in IO latency; presumably due to consuming all available network bandwidth for the network-attached persistent disks
Why did we snapshot the volume
1. As a safety measure in preparation for a change to the related filesystem
2. The engineer planning the change was not aware of the underlying architectural aspects of persistent disks (network attached, consuming shared bandwidth from VM instance), and therefore the potential for causing missed heartbeats/health checks

Incident Response Analysis

How was the incident detected?
1. Alert flood indicating downstream impacts
How could detection time be improved?
1. Not really applicable in this case, as the cause and effect were demonstrably linked, and clearly related; we knew immediately that the singular, last action resulted in a severe degradation or interruption of service on the primary patroni node.
How was the root cause diagnosed?
1. During the incident, demonstrable cause/effect identified the act of snapshotting the disk as the trigger event
2. Subsequent examination of metrics and logs reaffirmed the causal link between snapshot and failover, though the metric data seemed somewhat contradictory
3. Later we realized/remembered that disks are accessed using shared network bandwidth provisioned to an instance, which resulted in increased IO latency and ultimately caused disk access to stall on the primary, triggering the failover. (Need further specifics/external references on the mechanics involved here, if possible)
How could time to diagnosis be improved?
1. Much like detection, this was not really applicable, as the cause and effect were immediately, demonstrably linked, and clearly related
2. That being said, we did not realize at the time that the impact extended past degradation and prompted an actual failover during the actual incident. We canceled snapshots in two of the three zones when we saw signs of more serious degradation than anticipated, but only noticed the failover later when we proceeded with the subsequent steps in the change plan
How did we reach the point where we knew how to mitigate the impact?
1. Same as before, straightforward cause & effect
How could time to mitigation be improved?
1. N/A
What went well?
1. The additional visibility incurred by executing a production change on such a high-risk portion of the infrastructure meant that we had many additional eyes on hand to keep an eye on metrics and application testing simultaneously. Even with a direct causal link so evident, this dramatically shortened the time to realize that no significant negative impacts had been incurred, and that we really could safely proceed with executing the change plan.
2. This incident highlighted some areas for improvement in our change process

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. We have seen similar issues with snapshots resulting in patroni failovers
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Not at the time, no. We have logged corrective actions to further document the underlying architectural factors and impacts of snapshots on high-IO volumes
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. #2648 (closed)

Lessons Learned

Initial points to improve:

Don't GCP-snapshot a PG Leader that is up and running.
Tech reviewers have to review every single change/addition to the CR issue description, before it is approved by a manager/director.
The Datastores team should socialise in detail how the Backups and Snapshots work in our Production Database.
Wait for the start of the change (and the calendar invite) to start any actions related to it.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Feb 02, 2021 by Dave Smith