Database instance GCP maintenance events

Problem

During incident production#7212 (comment 978688703) we observed a big increase in IO latency in one database instance for a short duration ~1m, this however had a direct impact on SLIs and caused some errors.

In this GCP case we learned that all events originated from instance live migrations due to host maintenance events.

Solution

As a solution, we need to either handle or avoid live migrations for DB instances, there are a few options:

  1. Update maintenance policy to disable migration == instance is terminated
  2. Query for live migration notices and act on it

Best approach seems to be 2., the notice gives us 60 seconds to do something about it, for example adding a temporary noloadbalance tag to the Patroni instance.

A GCP Watchdog script exists and is running in the nodes, however there aren't any hooks configured (/etc/gcp_watchdog/hooks/*).

Caveat

There are some recurring cost implications on this solution as we might have to replace all N2 nodes for the newer C3 machines, as apparently only C3 VM class supports host maintenance event notification - https://cloud.google.com/compute/docs/instances/monitor-plan-host-maintenance-event#limitations

Currently only our GPRD patroni-main cluster is running over C3, all other clusters in GPRD and GSTG are deployed over N2.

Edited by Rafael Henchen