All postgres operations freeze when entire node goes down.
Summary
In a simple 2-instance-setup, all postgres queries freeze indefinitely when a k8s node goes down. It doesn't matter if the leader or the replica runs on it.
Current Behaviour
Steps to reproduce
I've setup a k8s cluster with 2 worker nodes. I created a super simple Stackgres cluster as the documentation describes it.
---
apiVersion: stackgres.io/v1
kind: SGCluster
metadata:
namespace: default
name: voo-development-db
spec:
postgres:
version: '14.1'
extensions:
- name: uuid-ossp
instances: 2
pods:
persistentVolume:
size: '10Gi'
storageClass: 'stackgres'
initialData:
scripts:
- name: create-voo-database
script: |
create database voo owner postgres;
prometheusAutobind: true
---
apiVersion: v1
kind: Service
metadata:
name: voo-development-db-nodeport
spec:
ports:
- port: 7432
nodePort: 30432
name: postgres
protocol: TCP
selector:
# as always: selector labels must match the labels from the pod template in the deployment
app: StackGresCluster
cluster-name: voo-development-db
role: master
# This declares the type of Service as nodePort.
type: NodePort
When I shutdown one of the worker nodes, Stackgres and Patroni still think everything is fine, but no queries can be executed against the database. Neither when I connect via the NodePort above nor via postgres-util. Nothing's working anymore.
But queries don't fail, they run into eternity until the 2nd node is restarted. It doesn't matter how long it takes. They never run into a timeout.
Expected Behaviour
Stackgres notices, that a node went down, since etcd in k8s is still fully functional and it elects a new leader or keeps connections alive when replica went down.
Possible Solution
Environment
- StackGres version: 1.1.0
- Kubernetes version: Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.7", GitCommit:"b56e432f2191419647a6a13b9f5867801850f969", GitTreeState:"clean", BuildDate:"2022-02-16T11:43:55Z", GoVersion:"go1.16.14", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: I'm running on 2 virtual worker nodes which are located in different data centers and connected in a private network via netmaker (wireguard). There is a 3rd worker node which is tainted to be an arbiter only for mongodb. So nothing else can be scheduled there. On each of the 2 main servers is one etcd instance and a controlplane. The 3rd etcd instance is hosted on AWS.
Relevant logs and/or screenshots
Here are some recent looped patroni logs of the remaining worker node:
2022-04-21 15:59:43 UTC [76055]: db=[unknown],user=[unknown],app=[unknown],client=[local] LOG: connection received: host=[local]
2022-04-21 15:59:43 UTC [76055]: db=voo,user=postgres,app=[unknown],client=[local] LOG: connection authorized: user=postgres database=voo
2022-04-21 15:59:43 UTC [76055]: db=voo,user=postgres,app=[unknown],client=[local] LOG: disconnection: session time: 0:00:00.043 user=postgres database=voo host=[local]