2021-01-17 - database failover triggered by GCP snapshots
Summary
During implementation of #2648 (closed), snapshots were initiated on all database servers concurrently, under the mistaken impression that the snapshot process was a transparent background transaction via GCP API, and would not impact the running nodes.
Timeline
All times UTC.
2021-01-16
-
23:36
- @craig initiates snapshots on all patroni nodes -
23:38
- begin seeing error spikes -
23:39
- multiple alerts received for service disruption -
23:40
- patroni invalidates the lock bypatroni-06
and begins failover -
23:41
-patroni-03
becomes the new primary -
23:42
- increased error rates fall off -
23:47
- @dawsmith reports system degradation, and cancellation of some snapshot requests -
23:56
- @craig declares incident in Slack.
Corrective Actions
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12355 - Update Patroni/Postgres runbooks to clarify GCP snapshot best practices.
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12363 - Add checklist for change management reviewers and approvers to runbooks with reference in handbook
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12518 - Schedule call with Google to talk more about IO sensitivites.
Incident Review
Summary
- Service(s) affected: ServicePatroni
- Team attribution: teamReliability
- Time to detection: 1-2 minutes
- Minutes downtime or degradation: 4-5 minutes
Metrics
{{https://gitlab.com/gitlab-com/gl-infra/production/uploads/717d15e470cb6b71217014c686640ea5/Screen_Shot_2021-01-16_at_4.48.01_PM.png}} {{https://gitlab.com/gitlab-com/gl-infra/production/uploads/90fc837f6035d69dcbce71bca9ac07bc/Screen_Shot_2021-01-16_at_4.47.51_PM.png}}
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- All active users during the incident
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Some would have received 5xx errors for transactions requiring database writes
- Page refreshes spot checked during the incident responded normally
-
How many customers were affected?
- Unknown, see below for details on requests vs error rates
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Full thread discussing impact: #3338 (comment 487128596)
- Rails
- approximately 2000 5xx errors with a spike approximately between 23:38:30 and 23:42:00
- 1,521,180 requests logged. so a poor calc 2019 / 1,521,180 =~ 0.133% error rate
- Workhorse
- 13757 / 481335 =~ 2.86 % error rate
- Aggregate
- If we zoom in to just the 1hr from 23:00 to 00:00UTC, .78 availability points were used.
What were the root causes?
- Why did we see errors across the fleet?
- The primary patroni node initiated a failover
- Why did patroni initiate a failover?
- The data disk experienced excessive IO latency
- The node experienced a loss of network connectivity to consul at the same time, either/both prompting the failover
- Why did the node experience high IO latency / lose network connectivity
- A snapshot initiated on the Postgres data volume caused a massive spike in IO latency; presumably due to consuming all available network bandwidth for the network-attached persistent disks
- Why did we snapshot the volume
- As a safety measure in preparation for a change to the related filesystem
- The engineer planning the change was not aware of the underlying architectural aspects of persistent disks (network attached, consuming shared bandwidth from VM instance), and therefore the potential for causing missed heartbeats/health checks
Incident Response Analysis
-
How was the incident detected?
- Alert flood indicating downstream impacts
-
How could detection time be improved?
- Not really applicable in this case, as the cause and effect were demonstrably linked, and clearly related; we knew immediately that the singular, last action resulted in a severe degradation or interruption of service on the primary patroni node.
-
How was the root cause diagnosed?
- During the incident, demonstrable cause/effect identified the act of snapshotting the disk as the trigger event
- Subsequent examination of metrics and logs reaffirmed the causal link between snapshot and failover, though the metric data seemed somewhat contradictory
- Later we realized/remembered that disks are accessed using shared network bandwidth provisioned to an instance, which resulted in increased IO latency and ultimately caused disk access to stall on the primary, triggering the failover. (Need further specifics/external references on the mechanics involved here, if possible)
-
How could time to diagnosis be improved?
- Much like detection, this was not really applicable, as the cause and effect were immediately, demonstrably linked, and clearly related
- That being said, we did not realize at the time that the impact extended past degradation and prompted an actual failover during the actual incident. We canceled snapshots in two of the three zones when we saw signs of more serious degradation than anticipated, but only noticed the failover later when we proceeded with the subsequent steps in the change plan
-
How did we reach the point where we knew how to mitigate the impact?
- Same as before, straightforward cause & effect
-
How could time to mitigation be improved?
- N/A
-
What went well?
- The additional visibility incurred by executing a production change on such a high-risk portion of the infrastructure meant that we had many additional eyes on hand to keep an eye on metrics and application testing simultaneously. Even with a direct causal link so evident, this dramatically shortened the time to realize that no significant negative impacts had been incurred, and that we really could safely proceed with executing the change plan.
- This incident highlighted some areas for improvement in our change process
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- We have seen similar issues with snapshots resulting in patroni failovers
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Not at the time, no. We have logged corrective actions to further document the underlying architectural factors and impacts of snapshots on high-IO volumes
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
Lessons Learned
Initial points to improve:
- Don't GCP-snapshot a PG Leader that is up and running.
- Tech reviewers have to review every single change/addition to the CR issue description, before it is approved by a manager/director.
- The Datastores team should socialise in detail how the Backups and Snapshots work in our Production Database.
- Wait for the start of the change (and the calendar invite) to start any actions related to it.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Incident Review Stakeholders
Edited by Dave Smith