Enable Patroni Maintenance Mode to prevent Postgres restarts during stress tests
Enable maintenance mode (Disable auto failover) for the Patroni cluster, to avoid restarting Postgres during a stress testing.
Fixed: https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/jobs/6188692704
FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 181 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 152 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 8 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 163 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 45 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 9 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 151 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 142 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: client 120 script 0 aborted in command 1 query 0: FATAL: terminating connection due to administrator command
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 SSL connection has been closed unexpectedly
...
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 pgbench: error: Run was aborted; the above results are incomplete.
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 ERROR: Failed to execute workload_pgbench command.
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:33 INFO: Stop sar observation...
WARNING: Connection terminated, reconnection attempt... (3 from 60).
[with_pgss_200_clients|workload_pgbench] 2024-02-16 10:46:35 INFO: Stop pidstat observation...
ERROR: Test stopped with error, exit code: 1.
Details
During stress testing, when all CPU cores are saturated, there may be a problem of communication with 'DCS' (etcd) for a long time, due to which Patroni cannot update the leader key to maintain the cluster in a healthy state (confirmation of the leader role), in connection with which a situation is possible in which Patroni restarts Postgres, to downgrade the Primary role to a replica.
Example:
Feb 16 19:01:15 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:15,694 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
Feb 16 19:01:22 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:19,519 INFO: Lock owner: vitaliy-test-pgnode01; I am vitaliy-test-pgnode01
Feb 16 19:01:22 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:22,858 ERROR: Request to server http://10.128.0.34:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='10.128.0.34', port=2379): Read timed out. (read timeout=3.333209136333229)")
Feb 16 19:01:22 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:22,858 INFO: Reconnection allowed, looking for another server.
Feb 16 19:01:26 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:25,370 ERROR: Request to server http://10.128.0.34:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='10.128.0.34', port=2379): Read timed out. (read timeout=2.0765518459712666)")
Feb 16 19:01:26 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:25,370 INFO: Reconnection allowed, looking for another server.
Feb 16 19:01:26 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:26,890 ERROR: Request to server http://10.128.0.34:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='10.128.0.34', port=2379): Read timed out. (read timeout=1.3159597042231326)")
Feb 16 19:01:26 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:26,890 INFO: Reconnection allowed, looking for another server.
Feb 16 19:01:29 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,524 ERROR: Request to server http://10.128.0.34:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='10.128.0.34', port=2379): Read timed out. (read timeout=1.6383134893990245)")
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,524 INFO: Reconnection allowed, looking for another server.
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,526 ERROR: Error communicating with DCS
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,526 ERROR: watchprefix failed: ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,527 INFO: demoting self because DCS is not accessible and I was a leader
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,527 INFO: Demoting self (offline)
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:29,528 WARNING: Loop time exceeded, rescheduling immediately.
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"warn","ts":"2024-02-16T19:01:32.101466Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"2.485612931s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/service/vitaliy-test/\" range_end:\"/service/vitaliy-test0\" serializable:true ","response":"range_response_count:6 size:3969"}
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"info","ts":"2024-02-16T19:01:32.101509Z","caller":"traceutil/trace.go:171","msg":"trace[1226141278] range","detail":"{range_begin:/service/vitaliy-test/; range_end:/service/vitaliy-test0; response_count:6; response_revision:136; }","duration":"2.485692822s","start":"2024-02-16T19:01:29.615807Z","end":"2024-02-16T19:01:32.1015Z","steps":["trace[1226141278] 'range keys from in-memory index tree' (duration: 2.485582487s)"],"step_count":1}
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"warn","ts":"2024-02-16T19:01:32.101533Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-02-16T19:01:29.6158Z","time spent":"2.485725537s","remote":"10.128.0.34:22884","response type":"/etcdserverpb.KV/Range","request count":0,"request size":50,"response count":6,"response size":3992,"request content":"key:\"/service/vitaliy-test/\" range_end:\"/service/vitaliy-test0\" serializable:true "}
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"info","ts":"2024-02-16T19:01:32.511641Z","caller":"traceutil/trace.go:171","msg":"trace[2033421025] put","detail":"{key:/service/vitaliy-test/members/vitaliy-test-pgnode01; req_size:247; response_revision:137; }","duration":"363.791535ms","start":"2024-02-16T19:01:32.147834Z","end":"2024-02-16T19:01:32.511626Z","steps":["trace[2033421025] 'process raft request' (duration: 363.715377ms)"],"step_count":1}
Feb 16 19:01:32 vitaliy-test-pgnode01 bash[1389]: {"level":"warn","ts":"2024-02-16T19:01:32.51172Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-02-16T19:01:32.147824Z","time spent":"363.87722ms","remote":"10.128.0.34:22884","response type":"/etcdserverpb.KV/Put","request count":1,"request size":247,"response count":0,"response size":28,"request content":"key:\"/service/vitaliy-test/members/vitaliy-test-pgnode01\" value_size:181 lease:8478745049474829604 "}
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:32,102 INFO: Lock owner: vitaliy-test-pgnode01; I am vitaliy-test-pgnode01
Feb 16 19:01:32 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:32,512 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
Feb 16 19:01:33 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:33,384 INFO: postmaster pid=12159
Feb 16 19:01:33 vitaliy-test-pgnode01 patroni[12160]: /var/run/postgresql:5432 - no response
Feb 16 19:01:34 vitaliy-test-pgnode01 patroni[12159]: 2024-02-16 19:01:34 UTC [12159-1] LOG: redirecting log output to logging collector process
Feb 16 19:01:34 vitaliy-test-pgnode01 patroni[12159]: 2024-02-16 19:01:34 UTC [12159-2] HINT: Future log output will appear in directory "/var/log/postgresql".
Feb 16 19:01:34 vitaliy-test-pgnode01 patroni[12172]: /var/run/postgresql:5432 - accepting connections
Feb 16 19:01:34 vitaliy-test-pgnode01 patroni[12174]: /var/run/postgresql:5432 - accepting connections
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:39,527 INFO: Lock owner: vitaliy-test-pgnode01; I am vitaliy-test-pgnode01
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:39,528 INFO: establishing a new patroni heartbeat connection to postgres
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:39,578 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:39,630 INFO: promoted self to leader because I had the session lock
Feb 16 19:01:39 vitaliy-test-pgnode01 patroni[12302]: server promoting
Feb 16 19:01:40 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:40,817 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
Feb 16 19:01:50 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:01:50,769 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
Feb 16 19:02:00 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:02:00,680 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
Feb 16 19:02:10 vitaliy-test-pgnode01 patroni[11326]: 2024-02-16 19:02:10,680 INFO: no action. I am (vitaliy-test-pgnode01), the leader with the lock
This change enables maintenance mode for Patroni cluster during the test to ignore a similar problem and avoid restarting Postgres.
This is necessary to improve the reliability of the dedicated tests in order to reduce the number of failed pipelines.