2019-08-14: increased latencies caused by a database failover

Summary

After a database failover we suspect was caused by a network issue, we are observing uneven distribution of database traffic across the replicas, which is caused increased latencies across GitLab.com.

Downtime minutes: 1

Degradation minutes: from 2019-08-14 08:28 UTC to about 14:30 UTC.

RCA: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7543

Timeline

2019-08-14

08:25 UTC - The patroni postgres cluster manager on the primary database instance (pg01) reports "ERROR: get_cluster"
08:27 UTC - Patroni initiates a failover, choosing pg04 as the new master. Wide downtime of the site begins.
08:28 UTC - the downtime alerts the on-call engineer. The failover completes and the downtime ends.
08:28 UTC - the old master (pg01) is unable to rejoin as a read replica due to a statement timeout running pg_rewind.
08:28 UTC - one read replica (pg06) begins to receive a disproportionate amount of traffic, causing client connections to queue. This causes widespread performance degradation that is still being investigated (as of 13:09 UTC).
13:40 UTC - the old master (pg01) successfully rejoins the cluster as a replica after manual intervention (will write up). This reduces queued client connections on pg06 and improves latency (not quite back to normal levels) and reduces the error rate (back to normal levels).
14:00 UTC - the client connections on pg06 begin to steadily reduce. eventually this improves latency, back to normal levels.
15:50 UTC - the new node (pg07) finishes bootstrapping and joins the cluster.

Edited Aug 19, 2019 by AnthonySandoval