Skip to content

All postgres operations freeze when entire node goes down.

Summary

In a simple 2-instance-setup, all postgres queries freeze indefinitely when a k8s node goes down. It doesn't matter if the leader or the replica runs on it.

Current Behaviour

Steps to reproduce

I've setup a k8s cluster with 2 worker nodes. I created a super simple Stackgres cluster as the documentation describes it.

---
apiVersion: stackgres.io/v1
kind: SGCluster
metadata:
  namespace: default
  name: voo-development-db
spec:
  postgres:
    version: '14.1'
    extensions: 
      - name: uuid-ossp
  instances: 2
  pods:
    persistentVolume:
      size: '10Gi'
      storageClass: 'stackgres'
  initialData:
    scripts:
    - name: create-voo-database
      script: |
        create database voo owner postgres;
  prometheusAutobind: true

---
apiVersion: v1
kind: Service
metadata:
  name: voo-development-db-nodeport
spec:
  ports:
    - port: 7432
      nodePort: 30432
      name: postgres
      protocol: TCP
  selector:
    # as always: selector labels must match the labels from the pod template in the deployment
    app: StackGresCluster
    cluster-name: voo-development-db
    role: master
  # This declares the type of Service as nodePort.
  type: NodePort

When I shutdown one of the worker nodes, Stackgres and Patroni still think everything is fine, but no queries can be executed against the database. Neither when I connect via the NodePort above nor via postgres-util. Nothing's working anymore.

But queries don't fail, they run into eternity until the 2nd node is restarted. It doesn't matter how long it takes. They never run into a timeout.

Expected Behaviour

Stackgres notices, that a node went down, since etcd in k8s is still fully functional and it elects a new leader or keeps connections alive when replica went down.

Possible Solution

Environment

  • StackGres version: 1.1.0
  • Kubernetes version: Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.7", GitCommit:"b56e432f2191419647a6a13b9f5867801850f969", GitTreeState:"clean", BuildDate:"2022-02-16T11:43:55Z", GoVersion:"go1.16.14", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: I'm running on 2 virtual worker nodes which are located in different data centers and connected in a private network via netmaker (wireguard). There is a 3rd worker node which is tainted to be an arbiter only for mongodb. So nothing else can be scheduled there. On each of the 2 main servers is one etcd instance and a controlplane. The 3rd etcd instance is hosted on AWS.

Relevant logs and/or screenshots

Here are some recent looped patroni logs of the remaining worker node:

2022-04-21 15:59:43 UTC [76055]: db=[unknown],user=[unknown],app=[unknown],client=[local] LOG:  connection received: host=[local]
2022-04-21 15:59:43 UTC [76055]: db=voo,user=postgres,app=[unknown],client=[local] LOG:  connection authorized: user=postgres database=voo
2022-04-21 15:59:43 UTC [76055]: db=voo,user=postgres,app=[unknown],client=[local] LOG:  disconnection: session time: 0:00:00.043 user=postgres database=voo host=[local]
Edited by Matteo Melli