RCA for brief gitlab.com outage on 2019-07-17

Noteworthy events:

Patroni initiated the failover, due to timeouts during health checks.
- Consul and Patroni each report intermittent timeouts in their mutual HTTP calls for a few minutes prior to failover.
- At 07:34:17 UTC, Patroni agent on host patroni-01 fails its get_cluster call to Consul. This means it cannot access the shared state of the cluster. This may have been the event that triggered the failover. It's reasonable to assume so. retry_timeout = 10 seconds, and per the Patroni docs, that timeout limit applies to both Consul and Postgres operations.
- At 07:34:23 UTC, Patroni agent on host patroni-01 proactively announced its distress by releasing Patroni's cluster lock (but not master role). Patroni peer nodes responded by announcing how fresh their replicas were, in anticipation of electing a new leader.
- At 07:34:27 UTC (10 seconds after get_cluster failed), Patroni sends Postgres a fast shutdown command. This aborts open transactions and cleanly halts the database, minimizing the chance of losing recently committed transactions during the upcoming failover.
- At 07:35:10 UTC, Patroni peers elect new leader as host patroni-04. It has the freshest recovery point. This node assumes the master role and becomes writable. Clients can now reconnect and return to service.
After failover, host patroni-01 had difficulty rejoining the cluster as a replica.
- At 07:36:24 UTC, pg_rewind was aborted due to statement_timeout. We should consider marking the postgres user used for replication/recovery as exempt from statement timeouts.
- At 10:53:53 UTC, host patroni-01 returned to the cluster as a replica. It had to restore a basebackup, since pg_rewind failed. That time and network and disk overhead could have been avoided if pg_rewind had succeeded.

Summary timeline:

2019-07-17

Timestamp (UTC)	Host	Service	Event
07:29:37 UTC	patroni-01	consul	Consul agent starts intermittently having timeout failures while sending pings to other remote consul agents, unrelated to the patroni cluster.
07:34:10 UTC	patroni-01	consul	Consul agent failed HTTP call to local Patroni agent (port 8301) for check "patroni-master" due to socket connect timeout.
07:34:16 UTC	patroni-01	consul	Consul agent failed HTTP call to local Patroni agent (port 8301) for check "patroni-master" due to socket read timeout. Same for check "patroni-replica:1".
07:34:17 UTC	patroni-01	patroni	Patroni local agent failed "get_cluster" HTTP call to Consul agent (port 8500) due to socket read timeout (3.33s) via loopback.
07:34:23 UTC	patroni-*	patroni	Patroni peers get message from patroni-01 that it is still alive and master but is releasing the cluster lock: cluster_unlocked: true, role: master, xlog.location: 306798768095232
07:34:23 UTC	patroni-*	patroni	Patroni peers each announce their xlog received_location. Freshest is now patroni-03 (not patroni-04). But cluster lock remains unclaimed, so leader race continues.
07:34:23 UTC	patroni-01	postgres	Postgres logs statement_timeout (15s) failures of single-row INSERT and UPDATE statements, indicating severe resource contention for at least 15 seconds.
07:34:27 UTC	patroni-01	postgres	Postgres receives 1st fast shutdown request and begins aborting transactions. Logged by postgres PID 55347. Presumably the shutdown signal came from local Patroni agent?
07:34:45 UTC	patroni-01	postgres	Postgres final pre-shutdown checkpoint complete.
07:34:50 UTC	patroni-01	postgres	Postgres online backup mode canceled.
07:35:00 UTC	patroni-*	patroni	Patroni peers log: "establishing a new patroni connection to the postgres cluster". Due to master postgres stopping "online backup mode" WAL stream.
07:35:00 UTC	patroni-*	patroni	Patroni peers get message from patroni-01 that it is stopping postgres.
07:35:00 UTC	patroni-*	patroni	Patroni peers each announce their xlog received_location. Freshest is now patroni-04. But cluster lock remains unclaimed, so leader race continues.
07:35:05 UTC	patroni-01	postgres	Postgres database shutdown complete.
07:35:10 UTC	patroni-*	patroni	Patroni peers get message from patroni-01 that it has stopped postgres and is relinquishing its role as master (now role "demoted").
07:35:10 UTC	patroni-04	patroni	Patroni on patroni-04 acquires cluster lock and promotes self to role master. It had the freshest data (xlog.location) when online backup mode ended.
07:35:10 UTC	patroni-01	patroni	Patroni local agent closed its connection to postgres (port 5432).
07:35:32 UTC	patroni-01	patroni	Patroni local agent logs that it demoted this instance from primary. Peers logged this active announcement 22 seconds ago. Why not locally logged until now?
07:35:32 UTC	patroni-01	postgres	Postgres startup in recovery mode.
07:35:32 UTC	patroni-01	postgres	Postgres reached consistent recovery state and begins accepting read-only connections.
07:35:47 UTC	patroni-01	postgres	Postgres received 2nd fast shutdown request and begins aborting transactions. Logged by postgres PID 42773. Presumably Patroni wants to use pg_rewind to join new primary's timeline.
07:35:47 UTC	patroni-01	postgres	Postgres 2nd database shutdown complete.
07:35:47 UTC	patroni-01	patroni	Patroni local agent starts running pg_rewind to get back to point of divergence of its timeline (21) from new master's (22).
07:36:00 UTC	patroni-01	patroni	Patroni local agent found last common checkpoint on timeline 21 with new master (patroni-04).
07:36:24 UTC	patroni-01	patroni	Patroni local agent failed to rewind! Statement timeout while fetching files from new master (patroni-04). Must now restore from basebackup.
10:53:53 UTC	patroni-01	postgres	Postgres database startup from freshly copied basebackup from a peer on the new timeline.

Edited Aug 09, 2022 by 🤖 GitLab Bot 🤖