Cluster Crash-Loop if Kubernetes API is temporarily unavailable
Summary
If the Kubernetes API is temporarily unavailable, the Cluster crashes and is unable to heal itself.
The only way to get the Cluster running again is to manually kill the primary pod.
Current Behaviour
Whenever our Kubernetes API is down for a short period of time (seconds),
the Pimary Pod crashes and remains in a crash recovery loop.
Even if the Kubernetes API is available again.
The cluster is down until someone restarts the pod.
Steps to reproduce
Timeout the Kubernetes-API Requests...
Reproduce depending a bit on the Cloud-Environment / Networking.
Expected Behaviour
When the Kubernetes API is back, the cluster should be able to heal itself.
Possible Solution
Auto-Kill the Pod via Livenessprobe.
A pod in this condition is definitely not live anymore...
Postmortem Logs: (WARNING: Postgresql is not running.)
Environment
- StackGres version: v1.2.1
- Kubernetes version: v1.22.5
- Cloud provider or hardware configuration: A Managed Kubernetes hosted on Openstack.
To be sure, the IP 10.240.16.1
is a Kubernetes Service with the Kubernetes API.
$ kubectl get service kubernetes -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kubernetes NodePort 10.240.16.1 <none> 443:31902/TCP 487d <none>
Postmortem logs
Here are the relevant parts from our logs (postmortem)
The ClusterName is app-pgsql
with user appUser
and the app-db
in this example.
We run with the following 3 nodes:
- app-pgsql-1
- app-pgsql-2 Primary
- app-pgsql-3
At the time of the crash, app-pgsql-2
was the primary
.
The path to a crash begins with a read timeout in the Primary
Pod.
app-pgsql-2 (Primary)
ERROR: Request to server https://10.240.16.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.240.16.1', port=443): Read timed out. (read timeout=4.9997613350860775)",)
[203] LOG C-0xcfc0d0: app-db/appUser@127.0.0.1:44462 login attempt: db=app-db user=appUser tls=no
[203] LOG C-0xcfc760: app-db/appUser@127.0.0.1:44460 login attempt: db=app-db user=appUser tls=no
[203] LOG C-0xcfc530: app-db/appUser@127.0.0.1:44464 login attempt: db=app-db user=appUser tls=no
[203] LOG C-0xcfc530: app-db/appUser@127.0.0.1:44464 closing because: client close request (age=0s)
[203] LOG C-0xcfc760: app-db/appUser@127.0.0.1:44460 closing because: client close request (age=0s)
[203] LOG C-0xcfc0d0: app-db/appUser@127.0.0.1:44462 closing because: client close request (age=0s)
WARNING: Concurrent update of app-pgsql
INFO: no action. I am (app-pgsql-2), the leader with the lock
after a few seconds and more failed requests, the primary demotes itself as expected with a network partition.
ERROR [io.st.cl.co.ClusterControllerReconciliationCycle] (Cluster-ReconciliationCycle) 83517| Cluster reconciliation cycle failed sending event while retrieving reconciliation cycle contexts: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.240.16.1/api/v1/namespaces/development/events. Message: rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.111.77:58466->10.244.109.41:2379: read: connection timed out. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.111.77:58466->10.244.109.41:2379: read: connection timed out, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='10.240.16.1', port=443): Read timed out. (read timeout=4.999653317965567)",)': /api/v1/namespaces/development/endpoints/app-pgsql
ERROR: failed to update leader lock
INFO: Demoting self (immediate-nolock)
Here is the full demotion from the Primary Pod:
INFO: Demoting self (immediate-nolock)
UTC [231]: db=,user=,app=,client= LOG: received immediate shutdown request
[203] WARNING S-0xd09560: app-db/appUser@unix:5432 got packet 'N' from server when not linked
[203] WARNING S-0xd09790: app-db/appUser@unix:5432 got packet 'N' from server when not linked
[203] LOG S-0xd09560: app-db/appUser@unix:5432 closing because: server conn crashed? (age=412s)
[203] LOG S-0xd08ed0: app-db/appUser@unix:5432 closing because: server conn crashed? (age=255s)
[203] LOG C-0xcfb5e0: app-db/appUser@127.0.0.1:33194 closing because: server conn crashed? (age=255s)
[203] WARNING C-0xcfb5e0: app-db/appUser@127.0.0.1:33194 pooler error: server conn crashed?
[203] WARNING S-0xd09330: app-db/authenticator@unix:5432 got packet 'N' from server when not linked
[203] WARNING S-0xd0a4b0: app-db/authenticator@unix:5432 got packet 'N' from server when not linked
[203] LOG S-0xd09330: app-db/authenticator@unix:5432 closing because: server conn crashed? (age=3345s)
[203] LOG S-0xd0a4b0: app-db/authenticator@unix:5432 closing because: server conn crashed? (age=2165s)
[203] LOG S-0xd09790: app-db/appUser@unix:5432 closing because: server conn crashed? (age=2766s)
[203] LOG S-0xd0a280: app-db/appUser@unix:5432 closing because: server conn crashed? (age=2676s)
[203] LOG C-0xcfb810: app-db/appUser@127.0.0.1:45186 closing because: server conn crashed? (age=145s)
[203] WARNING C-0xcfb810: app-db/appUser@127.0.0.1:45186 pooler error: server conn crashed?
[231]: db=,user=,app=,client= LOG: database system is shut down
[203] WARNING sbuf_connect failed: No such file or directory
[203] LOG S-0xd0a280: app-db/authenticator@unix:5432 closing because: connect failed (age=0s)
[203] LOG C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44198 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44198 pooler error: pgbouncer cannot connect to server
[203] LOG C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44200 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44200 pooler error: pgbouncer cannot connect to server
[203] LOG C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44338 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44338 pooler error: pgbouncer cannot connect to server
[203] LOG C-0xcfc530: app-db/(nouser)@127.0.0.1:44336 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfc530: app-db/(nouser)@127.0.0.1:44336 pooler error: pgbouncer cannot connect to server
[203] LOG C-0xcfc530: app-db/(nouser)@127.0.0.1:44340 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfc530: app-db/(nouser)@127.0.0.1:44340 pooler error: pgbouncer cannot connect to server
INFO: demoted self because failed to update leader lock in DCS
WARNING: Loop time exceeded, rescheduling immediately.
INFO: closed patroni connection to the postgresql cluster
INFO: Lock owner: app-pgsql-2; I am app-pgsql-2
INFO: updated leader lock during starting after demotion
INFO: postmaster pid=1177595
[1177595]: db=,user=,app=,client= FATAL: data directory "/var/lib/postgresql/data" has invalid permissions
[1177595]: db=,user=,app=,client= DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
/var/run/postgresql:5432 - no response
[203] LOG C-0xcfc530: app-db/(nouser)@127.0.0.1:44360 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfc530: app-db/(nouser)@127.0.0.1:44360 pooler error: pgbouncer cannot connect to server
ERROR: postmaster is not running
INFO [io.st.co.co.PersistentVolumeSizeReconciliator] (Cluster-ReconciliationCycle) Reconciling persistent volume claim sizes
INFO [io.st.co.po.PostgresBootstrapReconciliator] (Cluster-ReconciliationCycle) Cluster bootstrap completed
INFO [io.st.co.po.PostgresBootstrapReconciliator] (Cluster-ReconciliationCycle) Setting cluster arch x86_64 and os linux
INFO [io.st.co.ex.ExtensionReconciliator] (Cluster-ReconciliationCycle) Reconcile postgres extensions...
INFO [io.st.co.ex.ExtensionReconciliator] (Cluster-ReconciliationCycle) Reconciliation of postgres extensions completed
ERROR [io.st.cl.co.PgBouncerReconciliator] (Cluster-ReconciliationCycle) An error occurred while updating pgbouncer auth_file: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
...
Caused by: java.net.ConnectException: Connection refused
after that we get an infinite loop with the following message
/var/run/postgresql:5432 - no response
WARNING: Postgresql is not running.
INFO: Lock owner: app-pgsql-2; I am app-pgsql-2
INFO: pg_controldata:
pg_control version number: 1300
Catalog version number: 202107181
Database system identifier: 7059687927088955601
Database cluster state: in production
pg_control last modified: Mon Oct 10 21:30:26 2022
Latest checkpoint location: 2/C815AD20
Latest checkpoint's REDO location: 2/C815ACB8
Latest checkpoint's REDO WAL file: 000000CB00000002000000C8
Latest checkpoint's TimeLineID: 203
Latest checkpoint's PrevTimeLineID: 203
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:1034458
Latest checkpoint's NextOID: 90416
Latest checkpoint's NextMultiXactId: 2720
Latest checkpoint's NextMultiOffset: 5439
Latest checkpoint's oldestXID: 727
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 1034458
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 504930
Latest checkpoint's newestCommitTsXid: 1034457
Time of latest checkpoint: Mon Oct 10 21:30:22 2022
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: logical
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 20
max_prepared_xacts setting: 32
max_locks_per_xact setting: 128
track_commit_timestamp setting: on
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float8 argument passing: by value
Data page checksum version: 1
Mock authentication nonce: 3862e35396b451e79a4fd0cf1af11a55f6c4b485aea57187864790b6e50db529
INFO: doing crash recovery in a single user mode
ERROR: Crash recovery finished with code=1
INFO: stdout=
INFO: stderr=2022-10-10 21:39:03 UTC [1177659]: db=,user=,app=,client= FATAL: data directory "/var/lib/postgresql/data" has invalid permissions
[1177659]: db=,user=,app=,client= DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
INFO [io.st.co.po.PostgresBootstrapReconciliator] (Cluster-ReconciliationCycle) Cluster bootstrap completed
INFO [io.st.co.po.PostgresBootstrapReconciliator] (Cluster-ReconciliationCycle) Setting cluster arch x86_64 and os linux
INFO [io.st.co.ex.ExtensionReconciliator] (Cluster-ReconciliationCycle) Reconcile postgres extensions...
INFO [io.st.co.ex.ExtensionReconciliator] (Cluster-ReconciliationCycle) Reconciliation of postgres extensions completed
ERROR [io.st.cl.co.PgBouncerReconciliator] (Cluster-ReconciliationCycle) An error occurred while updating pgbouncer auth_file: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
In the meantime, the other pods have also reported simliar problems with the API.
app-pgsql-0 (replica)
ERROR [io.st.cl.co.ClusterControllerReconciliationCycle] (Cluster-ReconciliationCycle) 83535| Cluster reconciliation cycle failed reconciling development.app-pgsql: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://10.240.16.1/api/v1/namespaces/development/configmaps?labelSelector=stackgres.io%2Fcluster-name%3Dapp-pgsql%2Capp%3DStackGresCluster. Message: rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.111.77:58512->10.244.109.41:2379: read: connection timed out. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.111.77:58512->10.244.109.41:2379: read: connection timed out, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
app-pgsql-1 (replica)
ERROR [io.st.cl.co.PgBouncerReconciliator] (Cluster-ReconciliationCycle) An error occurred while updating pgbouncer auth_file: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [SGPoolingConfig] with name: [generated-from-default-1659596625556] in namespace: [development] failed.
...
Caused by: java.net.SocketTimeoutException: timeout
Thank you and best regards,
Ricardo