Patroni cluster keeps failing over due to Endpoints annotations being removed

Summary

Give a 2 node cluster, they keep failing over and, when trying to manually failover, it fails.

Current Behaviour

Once the cluster is ready, the patroni state (using patroni list) is wrong:

+ Cluster: my-db-cluster (6979305705338785925) -+-----------+
| Member          | Host | Role    | State | TL | Lag in MB |
+-----------------+------+---------+-------+----+-----------+
| my-db-cluster-0 |      | Replica |       |    |   unknown |
| my-db-cluster-1 |      | Leader  |       |    |           |
| my-db-cluster-2 |      | Replica |       |    |   unknown |
+-----------------+------+---------+-------+----+-----------+

Postgres timeline also keeps increasing:

➜ kubectl exec -it my-db-cluster-0 -c patroni -- patronictl topology
+ Cluster: my-db-cluster (uninitialized) -+---------+---------+----+-----------+
| Member            | Host                | Role    | State   | TL | Lag in MB |
+-------------------+---------------------+---------+---------+----+-----------+
| my-db-cluster-0   |                     | Leader  |         |    |           |
| + my-db-cluster-1 | 192.168.18.65:7433  | Replica | running | 12 |         0 |
| + my-db-cluster-2 | 192.168.43.218:7433 | Replica | running | 12 |         0 |
+-------------------+---------------------+---------+---------+----+-----------+

and after a few minutes (or less):

➜ kubectl exec -it my-db-cluster-0 -c patroni -- patronictl topology
+ Cluster: my-db-cluster (uninitialized) -+---------+---------+----+-----------+
| Member            | Host                | Role    | State   | TL | Lag in MB |
+-------------------+---------------------+---------+---------+----+-----------+
| my-db-cluster-2   |                     | Leader  |         |    |           |
| + my-db-cluster-0 | 192.168.88.123:7433 | Replica | running | 15 |         0 |
| + my-db-cluster-1 | 192.168.18.65:7433  | Replica | running | 15 |         0 |
+-------------------+---------------------+---------+---------+----+-----------+

in the logs there is no relevant error, but I got this messages frequently:

default     0s          Warning   Unhealthy                      pod/my-db-cluster-2                                                    Readiness probe failed: HTTP probe failed with statuscode: 503
default     0s          Normal    DistributedLogsUpdated         sgdistributedlogs/my-distributed-logs                                  StackGres Centralized Logging default.my-distributed-logs updated
default     0s          Normal    ClusterUpdated                 sgcluster/my-db-cluster                                                StackGres Cluster default.my-db-cluster updated
default     0s          Normal    DistributedLogsUpdated         sgdistributedlogs/my-distributed-logs                                  StackGres Centralized Logging default.my-distributed-logs updated
default     0s          Normal    ClusterUpdated                 sgcluster/my-db-cluster                                                StackGres Cluster default.my-db-cluster updated
default     0s          Normal    DistributedLogsUpdated         sgdistributedlogs/my-distributed-logs                                  StackGres Centralized Logging default.my-distributed-logs updated
default     0s          Normal    ClusterUpdated                 sgcluster/my-db-cluster                                                StackGres Cluster default.my-db-cluster updated
default     0s          Normal    DistributedLogsUpdated         sgdistributedlogs/my-distributed-logs                                  StackGres Centralized Logging default.my-distributed-logs updated
default     0s          Normal    ClusterUpdated                 sgcluster/my-db-cluster                                                StackGres Cluster default.my-db-cluster updated
default     0s          Normal    DistributedLogsUpdated         sgdistributedlogs/my-distributed-logs                                  StackGres Centralized Logging default.my-distributed-logs updated
default     0s          Normal    ClusterUpdated                 sgcluster/my-db-cluster                                                StackGres Cluster default.my-db-cluster updated
default     0s          Normal    DistributedLogsUpdated         sgdistributedlogs/my-distributed-logs                                  StackGres Centralized Logging default.my-distributed-logs updated
default     0s          Normal    ClusterUpdated                 sgcluster/my-db-cluster                                                StackGres Cluster default.my-db-cluster updated
default     0s          Normal    DistributedLogsUpdated         sgdistributedlogs/my-distributed-logs                                  StackGres Centralized Logging default.my-distributed-logs updated
default     0s          Normal    ClusterUpdated                 sgcluster/my-db-cluster                                                StackGres Cluster default.my-db-cluster updated
default     0s          Normal    DistributedLogsUpdated         sgdistributedlogs/my-distributed-logs                                  StackGres Centralized Logging default.my-distributed-logs updated

Patroni is failing over leaving following message in the log:

2021-06-30 01:19:13,578 INFO: Could not take out TTL lock
2021-06-30 01:19:13,712 INFO: demoted self after trying and failing to obtain lock
2021-06-30 01:19:13,713 INFO: Lock owner: my-db-cluster-2; I am my-db-cluster-0
2021-06-30 01:19:13,713 INFO: Lock owner: my-db-cluster-2; I am my-db-cluster-0
2021-06-30 01:19:13,713 INFO: starting after demotion in progress
2021-06-30 01:19:13,714 INFO: closed patroni connection to the postgresql cluster

StackGres operator is patching the Enpoints used for elections by Patroni and removing annotations so that the election state is lost:

---
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    acquireTime: "2021-06-30T14:55:11.668285+00:00"
    leader: my-db-cluster-2
    optime: "1208698360"
    renewTime: "2021-06-30T14:55:11.722641+00:00"
    transitions: "0"
    ttl: "30"
  creationTimestamp: "2021-06-30T14:18:41Z"
  labels:
    app: StackGresCluster
    cluster: "true"
    cluster-name: my-db-cluster
    cluster-uid: c61a5a37-dee1-4584-a0d4-b5907dddb691
  name: my-db-cluster
  namespace: default
  ownerReferences:
  - apiVersion: stackgres.io/v1
    controller: true
    kind: SGCluster
    name: my-db-cluster
    uid: c61a5a37-dee1-4584-a0d4-b5907dddb691
  resourceVersion: "21459"
  selfLink: /api/v1/namespaces/default/endpoints/my-db-cluster
  uid: 9f5c1372-6003-4e0d-8179-9d0974519481
subsets:
- addresses:
  - hostname: my-db-cluster-2
    ip: 192.168.6.62
    nodeName: ip-192-168-28-183.us-east-2.compute.internal
    targetRef:
      kind: Pod
      name: my-db-cluster-2
      namespace: default
      resourceVersion: "21414"
      uid: 300609ef-9474-4434-a90a-48e921780936
  ports:
  - name: pgport
    port: 7432
    protocol: TCP
  - name: pgreplication
    port: 7433
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  creationTimestamp: "2021-06-30T14:18:41Z"
  labels:
    app: StackGresCluster
    cluster: "true"
    cluster-name: my-db-cluster
    cluster-uid: c61a5a37-dee1-4584-a0d4-b5907dddb691
  name: my-db-cluster
  namespace: default
  ownerReferences:
  - apiVersion: stackgres.io/v1
    controller: true
    kind: SGCluster
    name: my-db-cluster
    uid: c61a5a37-dee1-4584-a0d4-b5907dddb691
  resourceVersion: "21517"
  selfLink: /api/v1/namespaces/default/endpoints/my-db-cluster
  uid: 9f5c1372-6003-4e0d-8179-9d0974519481

Steps to reproduce

Create a Kubernetes cluster with version 1.19
Create a StackGres cluster with 2 instances

Expected Behaviour

Patroni nodes are not failing over

Possible Solution

Annotations should not be overwritten but merged:

Required annotations will overwrite exiting annotations with the same key or, if no existing annotation match the same key, will be added.
Existing annotations will not be removed if no required annotation match its key.

Environment

StackGres version:

❯ kubectl get deployments -n stackgres stackgres-operator --template '{{ printf "%s\n" (index .spec.template.spec.containers 0).image }}'
stackgres/operator:1.0.0-beta1

Kubernetes version:

❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"archive", BuildDate:"2021-05-14T14:09:09Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.8-eks-96780e", GitCommit:"96780e1b30acbf0a52c38b6030d7853e575bcdf3", GitTreeState:"clean", BuildDate:"2021-03-10T21:32:29Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.19) exceeds the supported minor version skew of +/-1

Cloud provider or hardware configuration:

created with eksctl:

--node-type m5a.2xlarge --node-volume-size 100 --nodes 3

Relevant logs and/or screenshots

Edited Jul 01, 2021 by Matteo Melli