securityUpgrade broken in Stackgres v1.6.0
Summary
A securityUpgrade on a Postgres cluster with Stackgres v1.6.0 doesn't behave like expected. It still did with Stackgres v1.5.0. With v1.6.0 there is always downtime, both with "In Place" and with "Reduce Impact" enabled. All postgres pods get terminated at the same time and restart multiple times.
Current Behaviour
All pods are terminated, for both the "In Place" and "Reduce Impact" option. After termination, they both start again, but get terminated again, which takes a really long time. once termination of the master is finished it finally becomes ready again. Afterwards the same things happens for the replica. Then again the same happens for the master...
It looks like this (test-postgres-15-postgres-0 is the master as it is a brand new cluster):
NAME READY STATUS RESTARTS AGE
test-postgres-15-postgres-0 6/6 Running 0 11m
test-postgres-15-postgres-1 6/6 Running 0 3m51s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 0/1 ContainerCreating 0 1s
test-postgres-15-postgres-0 6/6 Running 0 11m
test-postgres-15-postgres-1 6/6 Running 0 3m54s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 3s
test-postgres-15-postgres-0 6/6 Terminating 0 11m
test-postgres-15-postgres-1 6/6 Terminating 0 3m56s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 6s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 16s
test-postgres-15-postgres-0 0/6 Init:0/5 0 3s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 28s
test-postgres-15-postgres-0 0/6 PodInitializing 0 15s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 31s
test-postgres-15-postgres-0 5/6 Running 0 18s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 33s
test-postgres-15-postgres-0 6/6 Running 0 20s
test-postgres-15-postgres-1 0/6 Pending 0 1s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 35s
test-postgres-15-postgres-0 6/6 Running 0 22s
test-postgres-15-postgres-1 0/6 Terminating 0 3s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 2m36s
test-postgres-15-postgres-0 6/6 Running 0 2m23s
test-postgres-15-postgres-1 0/6 Terminating 0 2m4s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 2m39s
test-postgres-15-postgres-0 6/6 Running 0 2m26s
test-postgres-15-postgres-1 0/6 Init:0/5 0 2s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 2m51s
test-postgres-15-postgres-0 6/6 Running 0 2m38s
test-postgres-15-postgres-1 0/6 PodInitializing 0 14s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 2m58s
test-postgres-15-postgres-0 6/6 Running 0 2m45s
test-postgres-15-postgres-1 6/6 Running 0 21s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 3m8s
test-postgres-15-postgres-0 6/6 Terminating 0 2m55s
test-postgres-15-postgres-1 6/6 Running 0 31s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 3m13s
test-postgres-15-postgres-0 0/6 Pending 0 0s
test-postgres-15-postgres-1 6/6 Running 0 36s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 3m16s
test-postgres-15-postgres-0 0/6 Init:0/5 0 3s
test-postgres-15-postgres-1 6/6 Running 0 39s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 3m31s
test-postgres-15-postgres-0 0/6 PodInitializing 0 18s
test-postgres-15-postgres-1 6/6 Running 0 54s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 3m33s
test-postgres-15-postgres-0 5/6 Running 0 20s
test-postgres-15-postgres-1 6/6 Running 0 56s
NAME READY STATUS RESTARTS AGE
op2023-11-29-10-31-22-9bj6m 1/1 Running 0 3m38s
test-postgres-15-postgres-0 6/6 Running 0 25s
test-postgres-15-postgres-1 6/6 Running 0 61s
So it takes almost 4 minutes with a lot of downtime.
Steps to reproduce
Deploy Stackgres operator v1.6.0 and deploy a simple sgCluster with 2 instances. Via the web UI create a securityUpgrade for that cluster.
Expected Behaviour
I expected the behaviour as with Stackgres v1.5.0. Where the replica gets terminated, upgraded, a neat failover from master to replica is done, and afterwards the old master (now replica) will get upgraded. With the same setup, same configuration and same postgres cluster (v14.6), it looks like this with Stackgres v1.5.0:
NAME READY STATUS RESTARTS AGE
test-postgres-15-postgres-0 6/6 Running 0 8m58s
test-postgres-15-postgres-1 6/6 Running 0 8m32s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 2s
test-postgres-15-postgres-0 6/6 Running 0 9m1s
test-postgres-15-postgres-1 6/6 Running 0 8m35s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 4s
test-postgres-15-postgres-0 6/6 Running 0 9m3s
test-postgres-15-postgres-1 6/6 Terminating 0 8m37s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 7s
test-postgres-15-postgres-0 6/6 Running 0 9m6s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 22s
test-postgres-15-postgres-0 6/6 Running 0 9m21s
test-postgres-15-postgres-1 0/6 Pending 0 1s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 24s
test-postgres-15-postgres-0 6/6 Running 0 9m23s
test-postgres-15-postgres-1 0/6 Init:0/5 0 3s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 44s
test-postgres-15-postgres-0 6/6 Running 0 9m43s
test-postgres-15-postgres-1 6/6 Running 0 23s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 71s
test-postgres-15-postgres-0 6/6 Terminating 0 10m
test-postgres-15-postgres-1 6/6 Running 0 50s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 76s
test-postgres-15-postgres-0 0/6 Init:0/5 0 2s
test-postgres-15-postgres-1 6/6 Running 0 55s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 1/1 Running 0 98s
test-postgres-15-postgres-0 6/6 Running 0 24s
test-postgres-15-postgres-1 6/6 Running 0 77s
NAME READY STATUS RESTARTS AGE
op2023-11-29-11-06-09-ctxv6 0/1 Completed 0 108s
test-postgres-15-postgres-0 6/6 Running 0 34s
test-postgres-15-postgres-1 6/6 Running 0 87s
Here it takes around 1,5 minutes without actual downtime.
Possible Solution
Don't know, looks like the script inside the pod that performs the commands is broken?
Environment
- StackGres version: v1.6.0
- Kubernetes version: v1.26.7
Relevant logs and/or screenshots
I uploaded both logs for the Stackgres v1.5.0 and v1.6.0 securityUpgrade in the attachment.