Skip to content

Enable Patroni Maintenance Mode to prevent Postgres restarts during stress tests

Vitaliy Kukharik requested to merge patroni-pause into master

Enable maintenance mode (Disable auto failover) for the Patroni cluster, to avoid restarting Postgres during a stress testing.

Fixed: https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/jobs/6188692704

FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 181 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 152 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 8 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 163 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 45 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 9 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 151 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 142 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 120 script 0 aborted in command 1 query 0: FATAL:  terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
...
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: Run was aborted; the above results are incomplete.
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 ERROR: Failed to execute workload_pgbench command.
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 INFO: Stop sar observation...
WARNING: Connection terminated, reconnection attempt... (3 from 60).
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:35 INFO: Stop pidstat observation...
ERROR: Test stopped with error, exit code: 1.

Details

During stress testing, when all CPU cores are saturated, there may be a problem of communication with 'DCS' (etcd) for a long time, due to which Patroni cannot update the leader key to maintain the cluster in a healthy state (confirmation of the leader role), in connection with which a situation is possible in which Patroni restarts Postgres, to downgrade the Primary role to a replica.

Example:

Feb 16 19:01:15 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:15,694 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
Feb 16 19:01:22 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:19,519 INFO: Lock owner: vitaliy-test-pgnode01; I am vitaliy-test-pgnode01
Feb 16 19:01:22 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:22,858 ERROR: Request to server http://10.128.0.34:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='10.128.0.34', port=2379): Read timed out. (read timeout=3.333209136333229)")
Feb 16 19:01:22 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:22,858 INFO: Reconnection allowed, looking for another server.
Feb 16 19:01:26 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:25,370 ERROR: Request to server http://10.128.0.34:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='10.128.0.34', port=2379): Read timed out. (read timeout=2.0765518459712666)")
Feb 16 19:01:26 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:25,370 INFO: Reconnection allowed, looking for another server.
Feb 16 19:01:26 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:26,890 ERROR: Request to server http://10.128.0.34:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='10.128.0.34', port=2379): Read timed out. (read timeout=1.3159597042231326)")
Feb 16 19:01:26 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:26,890 INFO: Reconnection allowed, looking for another server.
Feb 16 19:01:29 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,524 ERROR: Request to server http://10.128.0.34:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='10.128.0.34', port=2379): Read timed out. (read timeout=1.6383134893990245)")
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,524 INFO: Reconnection allowed, looking for another server.
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,526 ERROR: Error communicating with DCS
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,526 ERROR: watchprefix failed: ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,527 INFO: demoting self because DCS is not accessible and I was a leader
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,527 INFO: Demoting self (offline)
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,528 WARNING: Loop time exceeded, rescheduling immediately.
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"warn","ts":"2024-02-16T19:01:32.101466Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"2.485612931s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/service/vitaliy-test/\" range_end:\"/service/vitaliy-test0\" serializable:true ","response":"range_response_count:6 size:3969"}
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"info","ts":"2024-02-16T19:01:32.101509Z","caller":"traceutil/trace.go:171","msg":"trace[1226141278] range","detail":"{range_begin:/service/vitaliy-test/; range_end:/service/vitaliy-test0; response_count:6; response_revision:136; }","duration":"2.485692822s","start":"2024-02-16T19:01:29.615807Z","end":"2024-02-16T19:01:32.1015Z","steps":["trace[1226141278] 'range keys from in-memory index tree'  (duration: 2.485582487s)"],"step_count":1}
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"warn","ts":"2024-02-16T19:01:32.101533Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-02-16T19:01:29.6158Z","time spent":"2.485725537s","remote":"10.128.0.34:22884","response type":"/etcdserverpb.KV/Range","request count":0,"request size":50,"response count":6,"response size":3992,"request content":"key:\"/service/vitaliy-test/\" range_end:\"/service/vitaliy-test0\" serializable:true "}
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"info","ts":"2024-02-16T19:01:32.511641Z","caller":"traceutil/trace.go:171","msg":"trace[2033421025] put","detail":"{key:/service/vitaliy-test/members/vitaliy-test-pgnode01; req_size:247; response_revision:137; }","duration":"363.791535ms","start":"2024-02-16T19:01:32.147834Z","end":"2024-02-16T19:01:32.511626Z","steps":["trace[2033421025] 'process raft request'  (duration: 363.715377ms)"],"step_count":1}
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"warn","ts":"2024-02-16T19:01:32.51172Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-02-16T19:01:32.147824Z","time spent":"363.87722ms","remote":"10.128.0.34:22884","response type":"/etcdserverpb.KV/Put","request count":1,"request size":247,"response count":0,"response size":28,"request content":"key:\"/service/vitaliy-test/members/vitaliy-test-pgnode01\" value_size:181 lease:8478745049474829604 "}
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:32,102 INFO: Lock owner: vitaliy-test-pgnode01; I am vitaliy-test-pgnode01
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:32,512 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
Feb 16 19:01:33 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:33,384 INFO: postmaster pid=12159
Feb 16 19:01:33 vitaliy-test-pgnode01 patroni[12160]: /var/run/postgresql:5432 - no response
Feb 16 19:01:34 vitaliy-test-pgnode01 patroni[12159]: 2024-02-16 19:01:34 UTC [12159-1]  LOG:  redirecting log output to logging collector process
Feb 16 19:01:34 vitaliy-test-pgnode01 patroni[12159]: 2024-02-16 19:01:34 UTC [12159-2]  HINT:  Future log output will appear in directory "/var/log/postgresql".
Feb 16 19:01:34 vitaliy-test-pgnode01 patroni[12172]: /var/run/postgresql:5432 - accepting connections
Feb 16 19:01:34 vitaliy-test-pgnode01 patroni[12174]: /var/run/postgresql:5432 - accepting connections
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:39,527 INFO: Lock owner: vitaliy-test-pgnode01; I am vitaliy-test-pgnode01
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:39,528 INFO: establishing a new patroni heartbeat connection to postgres
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:39,578 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:39,630 INFO: promoted self to leader because I had the session lock
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[12302]: server promoting
Feb 16 19:01:40 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:40,817 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
Feb 16 19:01:50 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:50,769 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
Feb 16 19:02:00 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:02:00,680 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
Feb 16 19:02:10 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:02:10,680 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock

This change enables maintenance mode for Patroni cluster during the test to ignore a similar problem and avoid restarting Postgres.
This is necessary to improve the reliability of the dedicated tests in order to reduce the number of failed pipelines.

Edited by Vitaliy Kukharik

Merge request reports