Skip to content

RCA for brief gitlab.com outage on 2019-07-17

RCA for production#968 (closed).

Noteworthy events:

  • Patroni initiated the failover, due to timeouts during health checks.
    • Consul and Patroni each report intermittent timeouts in their mutual HTTP calls for a few minutes prior to failover.
    • At 07:34:17 UTC, Patroni agent on host patroni-01 fails its get_cluster call to Consul. This means it cannot access the shared state of the cluster. This may have been the event that triggered the failover. It's reasonable to assume so. retry_timeout = 10 seconds, and per the Patroni docs, that timeout limit applies to both Consul and Postgres operations.
    • At 07:34:23 UTC, Patroni agent on host patroni-01 proactively announced its distress by releasing Patroni's cluster lock (but not master role). Patroni peer nodes responded by announcing how fresh their replicas were, in anticipation of electing a new leader.
    • At 07:34:27 UTC (10 seconds after get_cluster failed), Patroni sends Postgres a fast shutdown command. This aborts open transactions and cleanly halts the database, minimizing the chance of losing recently committed transactions during the upcoming failover.
    • At 07:35:10 UTC, Patroni peers elect new leader as host patroni-04. It has the freshest recovery point. This node assumes the master role and becomes writable. Clients can now reconnect and return to service.
  • After failover, host patroni-01 had difficulty rejoining the cluster as a replica.
    • At 07:36:24 UTC, pg_rewind was aborted due to statement_timeout. We should consider marking the postgres user used for replication/recovery as exempt from statement timeouts.
    • At 10:53:53 UTC, host patroni-01 returned to the cluster as a replica. It had to restore a basebackup, since pg_rewind failed. That time and network and disk overhead could have been avoided if pg_rewind had succeeded.

Summary timeline:

2019-07-17

Timestamp (UTC) Host Service Event
07:29:37 UTC patroni-01 consul Consul agent starts intermittently having timeout failures while sending pings to other remote consul agents, unrelated to the patroni cluster.
07:34:10 UTC patroni-01 consul Consul agent failed HTTP call to local Patroni agent (port 8301) for check "patroni-master" due to socket connect timeout.
07:34:16 UTC patroni-01 consul Consul agent failed HTTP call to local Patroni agent (port 8301) for check "patroni-master" due to socket read timeout. Same for check "patroni-replica:1".
07:34:17 UTC patroni-01 patroni Patroni local agent failed "get_cluster" HTTP call to Consul agent (port 8500) due to socket read timeout (3.33s) via loopback.
07:34:23 UTC patroni-* patroni Patroni peers get message from patroni-01 that it is still alive and master but is releasing the cluster lock: cluster_unlocked: true, role: master, xlog.location: 306798768095232
07:34:23 UTC patroni-* patroni Patroni peers each announce their xlog received_location. Freshest is now patroni-03 (not patroni-04). But cluster lock remains unclaimed, so leader race continues.
07:34:23 UTC patroni-01 postgres Postgres logs statement_timeout (15s) failures of single-row INSERT and UPDATE statements, indicating severe resource contention for at least 15 seconds.
07:34:27 UTC patroni-01 postgres Postgres receives 1st fast shutdown request and begins aborting transactions. Logged by postgres PID 55347. Presumably the shutdown signal came from local Patroni agent?
07:34:45 UTC patroni-01 postgres Postgres final pre-shutdown checkpoint complete.
07:34:50 UTC patroni-01 postgres Postgres online backup mode canceled.
07:35:00 UTC patroni-* patroni Patroni peers log: "establishing a new patroni connection to the postgres cluster". Due to master postgres stopping "online backup mode" WAL stream.
07:35:00 UTC patroni-* patroni Patroni peers get message from patroni-01 that it is stopping postgres.
07:35:00 UTC patroni-* patroni Patroni peers each announce their xlog received_location. Freshest is now patroni-04. But cluster lock remains unclaimed, so leader race continues.
07:35:05 UTC patroni-01 postgres Postgres database shutdown complete.
07:35:10 UTC patroni-* patroni Patroni peers get message from patroni-01 that it has stopped postgres and is relinquishing its role as master (now role "demoted").
07:35:10 UTC patroni-04 patroni Patroni on patroni-04 acquires cluster lock and promotes self to role master. It had the freshest data (xlog.location) when online backup mode ended.
07:35:10 UTC patroni-01 patroni Patroni local agent closed its connection to postgres (port 5432).
07:35:32 UTC patroni-01 patroni Patroni local agent logs that it demoted this instance from primary. Peers logged this active announcement 22 seconds ago. Why not locally logged until now?
07:35:32 UTC patroni-01 postgres Postgres startup in recovery mode.
07:35:32 UTC patroni-01 postgres Postgres reached consistent recovery state and begins accepting read-only connections.
07:35:47 UTC patroni-01 postgres Postgres received 2nd fast shutdown request and begins aborting transactions. Logged by postgres PID 42773. Presumably Patroni wants to use pg_rewind to join new primary's timeline.
07:35:47 UTC patroni-01 postgres Postgres 2nd database shutdown complete.
07:35:47 UTC patroni-01 patroni Patroni local agent starts running pg_rewind to get back to point of divergence of its timeline (21) from new master's (22).
07:36:00 UTC patroni-01 patroni Patroni local agent found last common checkpoint on timeline 21 with new master (patroni-04).
07:36:24 UTC patroni-01 patroni Patroni local agent failed to rewind! Statement timeout while fetching files from new master (patroni-04). Must now restore from basebackup.
10:53:53 UTC patroni-01 postgres Postgres database startup from freshly copied basebackup from a peer on the new timeline.
Edited by 🤖 GitLab Bot 🤖