Cluster Crash-Loop if Kubernetes API is temporarily unavailable

Summary

If the Kubernetes API is temporarily unavailable, the Cluster crashes and is unable to heal itself.
The only way to get the Cluster running again is to manually kill the primary pod.

Current Behaviour

Whenever our Kubernetes API is down for a short period of time (seconds),
the Pimary Pod crashes and remains in a crash recovery loop.
Even if the Kubernetes API is available again.

The cluster is down until someone restarts the pod.

Steps to reproduce

Timeout the Kubernetes-API Requests...
Reproduce depending a bit on the Cloud-Environment / Networking.

Expected Behaviour

When the Kubernetes API is back, the cluster should be able to heal itself.

Possible Solution

Auto-Kill the Pod via Livenessprobe.
A pod in this condition is definitely not live anymore...
Postmortem Logs: (WARNING: Postgresql is not running.)

Environment

StackGres version: v1.2.1
Kubernetes version: v1.22.5
Cloud provider or hardware configuration: A Managed Kubernetes hosted on Openstack.

To be sure, the IP 10.240.16.1 is a Kubernetes Service with the Kubernetes API.

$ kubectl get service kubernetes -o wide
NAME         TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE    SELECTOR
kubernetes   NodePort   10.240.16.1   <none>        443:31902/TCP   487d   <none>

Postmortem logs

Here are the relevant parts from our logs (postmortem)

The ClusterName is app-pgsql with user appUser and the app-db in this example. We run with the following 3 nodes:

app-pgsql-1
app-pgsql-2 Primary
app-pgsql-3

At the time of the crash, app-pgsql-2 was the primary. The path to a crash begins with a read timeout in the Primary Pod.

app-pgsql-2 (Primary)

ERROR: Request to server https://10.240.16.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.240.16.1', port=443): Read timed out. (read timeout=4.9997613350860775)",)
[203] LOG C-0xcfc0d0: app-db/appUser@127.0.0.1:44462 login attempt: db=app-db user=appUser tls=no
[203] LOG C-0xcfc760: app-db/appUser@127.0.0.1:44460 login attempt: db=app-db user=appUser tls=no
[203] LOG C-0xcfc530: app-db/appUser@127.0.0.1:44464 login attempt: db=app-db user=appUser tls=no
[203] LOG C-0xcfc530: app-db/appUser@127.0.0.1:44464 closing because: client close request (age=0s)
[203] LOG C-0xcfc760: app-db/appUser@127.0.0.1:44460 closing because: client close request (age=0s)
[203] LOG C-0xcfc0d0: app-db/appUser@127.0.0.1:44462 closing because: client close request (age=0s)
WARNING: Concurrent update of app-pgsql
INFO: no action. I am (app-pgsql-2), the leader with the lock

after a few seconds and more failed requests, the primary demotes itself as expected with a network partition.

ERROR [io.st.cl.co.ClusterControllerReconciliationCycle] (Cluster-ReconciliationCycle) 83517| Cluster reconciliation cycle failed sending event while retrieving reconciliation cycle contexts: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.240.16.1/api/v1/namespaces/development/events. Message: rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.111.77:58466->10.244.109.41:2379: read: connection timed out. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.111.77:58466->10.244.109.41:2379: read: connection timed out, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='10.240.16.1', port=443): Read timed out. (read timeout=4.999653317965567)",)': /api/v1/namespaces/development/endpoints/app-pgsql

ERROR: failed to update leader lock
INFO: Demoting self (immediate-nolock)

Here is the full demotion from the Primary Pod:

INFO: Demoting self (immediate-nolock)
UTC [231]: db=,user=,app=,client= LOG:  received immediate shutdown request
[203] WARNING S-0xd09560: app-db/appUser@unix:5432 got packet 'N' from server when not linked
[203] WARNING S-0xd09790: app-db/appUser@unix:5432 got packet 'N' from server when not linked
[203] LOG S-0xd09560: app-db/appUser@unix:5432 closing because: server conn crashed? (age=412s)
[203] LOG S-0xd08ed0: app-db/appUser@unix:5432 closing because: server conn crashed? (age=255s)
[203] LOG C-0xcfb5e0: app-db/appUser@127.0.0.1:33194 closing because: server conn crashed? (age=255s)
[203] WARNING C-0xcfb5e0: app-db/appUser@127.0.0.1:33194 pooler error: server conn crashed?
[203] WARNING S-0xd09330: app-db/authenticator@unix:5432 got packet 'N' from server when not linked
[203] WARNING S-0xd0a4b0: app-db/authenticator@unix:5432 got packet 'N' from server when not linked
[203] LOG S-0xd09330: app-db/authenticator@unix:5432 closing because: server conn crashed? (age=3345s)
[203] LOG S-0xd0a4b0: app-db/authenticator@unix:5432 closing because: server conn crashed? (age=2165s)
[203] LOG S-0xd09790: app-db/appUser@unix:5432 closing because: server conn crashed? (age=2766s)
[203] LOG S-0xd0a280: app-db/appUser@unix:5432 closing because: server conn crashed? (age=2676s)
[203] LOG C-0xcfb810: app-db/appUser@127.0.0.1:45186 closing because: server conn crashed? (age=145s)
[203] WARNING C-0xcfb810: app-db/appUser@127.0.0.1:45186 pooler error: server conn crashed?
[231]: db=,user=,app=,client= LOG:  database system is shut down
[203] WARNING sbuf_connect failed: No such file or directory
[203] LOG S-0xd0a280: app-db/authenticator@unix:5432 closing because: connect failed (age=0s)	
[203] LOG C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44198 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44198 pooler error: pgbouncer cannot connect to server
[203] LOG C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44200 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44200 pooler error: pgbouncer cannot connect to server
[203] LOG C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44338 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfb5e0: app-db/(nouser)@127.0.0.1:44338 pooler error: pgbouncer cannot connect to server
[203] LOG C-0xcfc530: app-db/(nouser)@127.0.0.1:44336 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfc530: app-db/(nouser)@127.0.0.1:44336 pooler error: pgbouncer cannot connect to server
[203] LOG C-0xcfc530: app-db/(nouser)@127.0.0.1:44340 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfc530: app-db/(nouser)@127.0.0.1:44340 pooler error: pgbouncer cannot connect to server
INFO: demoted self because failed to update leader lock in DCS
WARNING: Loop time exceeded, rescheduling immediately.
INFO: closed patroni connection to the postgresql cluster
INFO: Lock owner: app-pgsql-2; I am app-pgsql-2
INFO: updated leader lock during starting after demotion
INFO: postmaster pid=1177595
[1177595]: db=,user=,app=,client= FATAL:  data directory "/var/lib/postgresql/data" has invalid permissions
[1177595]: db=,user=,app=,client= DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
/var/run/postgresql:5432 - no response
[203] LOG C-0xcfc530: app-db/(nouser)@127.0.0.1:44360 closing because: pgbouncer cannot connect to server (age=0s)
[203] WARNING C-0xcfc530: app-db/(nouser)@127.0.0.1:44360 pooler error: pgbouncer cannot connect to server
ERROR: postmaster is not running
INFO  [io.st.co.co.PersistentVolumeSizeReconciliator] (Cluster-ReconciliationCycle) Reconciling persistent volume claim sizes
INFO  [io.st.co.po.PostgresBootstrapReconciliator] (Cluster-ReconciliationCycle) Cluster bootstrap completed
INFO  [io.st.co.po.PostgresBootstrapReconciliator] (Cluster-ReconciliationCycle) Setting cluster arch x86_64 and os linux
INFO  [io.st.co.ex.ExtensionReconciliator] (Cluster-ReconciliationCycle) Reconcile postgres extensions...
INFO  [io.st.co.ex.ExtensionReconciliator] (Cluster-ReconciliationCycle) Reconciliation of postgres extensions completed
ERROR [io.st.cl.co.PgBouncerReconciliator] (Cluster-ReconciliationCycle) An error occurred while updating pgbouncer auth_file: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
...
Caused by: java.net.ConnectException: Connection refused

after that we get an infinite loop with the following message

/var/run/postgresql:5432 - no response
WARNING: Postgresql is not running.
INFO: Lock owner: app-pgsql-2; I am app-pgsql-2
INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202107181
  Database system identifier: 7059687927088955601
  Database cluster state: in production
  pg_control last modified: Mon Oct 10 21:30:26 2022
  Latest checkpoint location: 2/C815AD20
  Latest checkpoint's REDO location: 2/C815ACB8
  Latest checkpoint's REDO WAL file: 000000CB00000002000000C8
  Latest checkpoint's TimeLineID: 203
  Latest checkpoint's PrevTimeLineID: 203
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:1034458
  Latest checkpoint's NextOID: 90416
  Latest checkpoint's NextMultiXactId: 2720
  Latest checkpoint's NextMultiOffset: 5439
  Latest checkpoint's oldestXID: 727
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 1034458
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 504930
  Latest checkpoint's newestCommitTsXid: 1034457
  Time of latest checkpoint: Mon Oct 10 21:30:22 2022
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 20
  max_prepared_xacts setting: 32
  max_locks_per_xact setting: 128
  track_commit_timestamp setting: on
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 3862e35396b451e79a4fd0cf1af11a55f6c4b485aea57187864790b6e50db529
INFO: doing crash recovery in a single user mode
ERROR: Crash recovery finished with code=1
INFO:  stdout=
INFO:  stderr=2022-10-10 21:39:03 UTC [1177659]: db=,user=,app=,client= FATAL:  data directory "/var/lib/postgresql/data" has invalid permissions
[1177659]: db=,user=,app=,client= DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
INFO  [io.st.co.po.PostgresBootstrapReconciliator] (Cluster-ReconciliationCycle) Cluster bootstrap completed
INFO  [io.st.co.po.PostgresBootstrapReconciliator] (Cluster-ReconciliationCycle) Setting cluster arch x86_64 and os linux
INFO  [io.st.co.ex.ExtensionReconciliator] (Cluster-ReconciliationCycle) Reconcile postgres extensions...	
INFO  [io.st.co.ex.ExtensionReconciliator] (Cluster-ReconciliationCycle) Reconciliation of postgres extensions completed
ERROR [io.st.cl.co.PgBouncerReconciliator] (Cluster-ReconciliationCycle) An error occurred while updating pgbouncer auth_file: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

In the meantime, the other pods have also reported simliar problems with the API.

app-pgsql-0 (replica)

ERROR [io.st.cl.co.ClusterControllerReconciliationCycle] (Cluster-ReconciliationCycle) 83535| Cluster reconciliation cycle failed reconciling development.app-pgsql: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://10.240.16.1/api/v1/namespaces/development/configmaps?labelSelector=stackgres.io%2Fcluster-name%3Dapp-pgsql%2Capp%3DStackGresCluster. Message: rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.111.77:58512->10.244.109.41:2379: read: connection timed out. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.111.77:58512->10.244.109.41:2379: read: connection timed out, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).

app-pgsql-1 (replica)

ERROR [io.st.cl.co.PgBouncerReconciliator] (Cluster-ReconciliationCycle) An error occurred while updating pgbouncer auth_file: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [SGPoolingConfig]  with name: [generated-from-default-1659596625556]  in namespace: [development]  failed.
...
Caused by: java.net.SocketTimeoutException: timeout

Thank you and best regards,
Ricardo