RCA for brief gitlab.com outage on 2019-07-17
RCA for production#968 (closed).
Noteworthy events:
- Patroni initiated the failover, due to timeouts during health checks.
- Consul and Patroni each report intermittent timeouts in their mutual HTTP calls for a few minutes prior to failover.
- At 07:34:17 UTC, Patroni agent on host patroni-01 fails its
get_clustercall to Consul. This means it cannot access the shared state of the cluster. This may have been the event that triggered the failover. It's reasonable to assume so.retry_timeout= 10 seconds, and per the Patroni docs, that timeout limit applies to both Consul and Postgres operations. - At 07:34:23 UTC, Patroni agent on host patroni-01 proactively announced its distress by releasing Patroni's cluster lock (but not master role). Patroni peer nodes responded by announcing how fresh their replicas were, in anticipation of electing a new leader.
- At 07:34:27 UTC (10 seconds after
get_clusterfailed), Patroni sends Postgres afast shutdowncommand. This aborts open transactions and cleanly halts the database, minimizing the chance of losing recently committed transactions during the upcoming failover. - At 07:35:10 UTC, Patroni peers elect new leader as host patroni-04. It has the freshest recovery point. This node assumes the master role and becomes writable. Clients can now reconnect and return to service.
- After failover, host patroni-01 had difficulty rejoining the cluster as a replica.
- At 07:36:24 UTC,
pg_rewindwas aborted due tostatement_timeout. We should consider marking the postgres user used for replication/recovery as exempt from statement timeouts. - At 10:53:53 UTC, host patroni-01 returned to the cluster as a replica. It had to restore a basebackup, since pg_rewind failed. That time and network and disk overhead could have been avoided if pg_rewind had succeeded.
- At 07:36:24 UTC,
Summary timeline:
2019-07-17
| Timestamp (UTC) | Host | Service | Event |
|---|---|---|---|
| 07:29:37 UTC | patroni-01 | consul | Consul agent starts intermittently having timeout failures while sending pings to other remote consul agents, unrelated to the patroni cluster. |
| 07:34:10 UTC | patroni-01 | consul | Consul agent failed HTTP call to local Patroni agent (port 8301) for check "patroni-master" due to socket connect timeout. |
| 07:34:16 UTC | patroni-01 | consul | Consul agent failed HTTP call to local Patroni agent (port 8301) for check "patroni-master" due to socket read timeout. Same for check "patroni-replica:1". |
| 07:34:17 UTC | patroni-01 | patroni | Patroni local agent failed "get_cluster" HTTP call to Consul agent (port 8500) due to socket read timeout (3.33s) via loopback. |
| 07:34:23 UTC | patroni-* | patroni | Patroni peers get message from patroni-01 that it is still alive and master but is releasing the cluster lock: cluster_unlocked: true, role: master, xlog.location: 306798768095232 |
| 07:34:23 UTC | patroni-* | patroni | Patroni peers each announce their xlog received_location. Freshest is now patroni-03 (not patroni-04). But cluster lock remains unclaimed, so leader race continues. |
| 07:34:23 UTC | patroni-01 | postgres | Postgres logs statement_timeout (15s) failures of single-row INSERT and UPDATE statements, indicating severe resource contention for at least 15 seconds. |
| 07:34:27 UTC | patroni-01 | postgres | Postgres receives 1st fast shutdown request and begins aborting transactions. Logged by postgres PID 55347. Presumably the shutdown signal came from local Patroni agent? |
| 07:34:45 UTC | patroni-01 | postgres | Postgres final pre-shutdown checkpoint complete. |
| 07:34:50 UTC | patroni-01 | postgres | Postgres online backup mode canceled. |
| 07:35:00 UTC | patroni-* | patroni | Patroni peers log: "establishing a new patroni connection to the postgres cluster". Due to master postgres stopping "online backup mode" WAL stream. |
| 07:35:00 UTC | patroni-* | patroni | Patroni peers get message from patroni-01 that it is stopping postgres. |
| 07:35:00 UTC | patroni-* | patroni | Patroni peers each announce their xlog received_location. Freshest is now patroni-04. But cluster lock remains unclaimed, so leader race continues. |
| 07:35:05 UTC | patroni-01 | postgres | Postgres database shutdown complete. |
| 07:35:10 UTC | patroni-* | patroni | Patroni peers get message from patroni-01 that it has stopped postgres and is relinquishing its role as master (now role "demoted"). |
| 07:35:10 UTC | patroni-04 | patroni | Patroni on patroni-04 acquires cluster lock and promotes self to role master. It had the freshest data (xlog.location) when online backup mode ended. |
| 07:35:10 UTC | patroni-01 | patroni | Patroni local agent closed its connection to postgres (port 5432). |
| 07:35:32 UTC | patroni-01 | patroni | Patroni local agent logs that it demoted this instance from primary. Peers logged this active announcement 22 seconds ago. Why not locally logged until now? |
| 07:35:32 UTC | patroni-01 | postgres | Postgres startup in recovery mode. |
| 07:35:32 UTC | patroni-01 | postgres | Postgres reached consistent recovery state and begins accepting read-only connections. |
| 07:35:47 UTC | patroni-01 | postgres | Postgres received 2nd fast shutdown request and begins aborting transactions. Logged by postgres PID 42773. Presumably Patroni wants to use pg_rewind to join new primary's timeline. |
| 07:35:47 UTC | patroni-01 | postgres | Postgres 2nd database shutdown complete. |
| 07:35:47 UTC | patroni-01 | patroni | Patroni local agent starts running pg_rewind to get back to point of divergence of its timeline (21) from new master's (22). |
| 07:36:00 UTC | patroni-01 | patroni | Patroni local agent found last common checkpoint on timeline 21 with new master (patroni-04). |
| 07:36:24 UTC | patroni-01 | patroni | Patroni local agent failed to rewind! Statement timeout while fetching files from new master (patroni-04). Must now restore from basebackup. |
| 10:53:53 UTC | patroni-01 | postgres | Postgres database startup from freshly copied basebackup from a peer on the new timeline. |
Edited by 🤖 GitLab Bot 🤖