Backup fails sometimes after a restart is executed
Summary
After any type of restart DbOps operation is executed (restart, minorVersionUpgrade, majorVersionUpgrade) backups are completed anymore.
Steps to reproduce
- Create a SGCluster
- Add some data to the SGCluster (doesn't need to be much, only a few megabytes)
- Configure the backups to run at every 10 minutes
- Schedule DbOps restart, to run between backups so it doesn't execute at the same time.
- The backups that are run after the the restart would be stuck at running.
Helper script to reproduce the issue
Expected Behaviour
Backups should keep running without errors.
Environment
- StackGres version: 1.0.0
- Kubernetes version: 1.16
- Cloud provider or hardware configuration: eks
Relevant logs and/or screenshots
The backups that are stuck shows the following log
Lock acquired
I1014 10:25:05.461661 40 request.go:645] Throttling request took 1.184064812s, request: GET:https://10.100.0.1:443/apis/coordination.k8s.io/v1beta1?timeout=32s
I1014 10:25:09.529918 8383 request.go:645] Throttling request took 1.183353358s, request: GET:https://10.100.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s
I1014 10:25:13.601303 16731 request.go:645] Throttling request took 1.183631134s, request: GET:https://10.100.0.1:443/apis/coordination.k8s.io/v1beta1?timeout=32s
Creating backup CR
apiVersion: stackgres.io/v1
kind: SGBackup
metadata:
annotations:
scheduled-backup: "true"
stackgres.io/operatorVersion: 1.0.0
creationTimestamp: "2021-10-14T10:25:20Z"
generation: 1
name: test-2021-10-14-10-25-08
namespace: backup-test
resourceVersion: "1146108"
selfLink: /apis/stackgres.io/v1/namespaces/backup-test/sgbackups/test-2021-10-14-10-25-08
uid: 35e3ffbe-4f6d-465b-83cf-e5760c7526a3
spec:
managedLifecycle: true
sgCluster: test
status:
process:
jobPod: test-backup-1634207100-pds2g
status: Running
sgBackupConfig:
baseBackups:
compression: lz4
storage:
s3:
awsCredentials:
secretKeySelectors:
accessKeyId:
key: accessKeyId
name: s3-backup-bucket-secret
secretAccessKey:
key: secretAccessKey
name: s3-backup-bucket-secret
bucket: xsierra-bucket
type: s3
I1014 10:25:21.946720 1168 request.go:645] Throttling request took 1.184166284s, request: GET:https://10.100.0.1:443/apis/coordination.k8s.io/v1?timeout=32s
Retrieving primary and replica
Primary is pod/test-0
Replica is pod/test-1
Performing backup
Backup completed
Extracting pg_controldata
Extraction of pg_controldata completed
Retain backups
Check if backup base_000000010000000000000006 has to be retained and will retain 6 backups
Mark base_000000010000000000000006 as permanent and will retain 5 more backups
INFO: 2021/10/14 10:25:32.597349 Retrieving previous related backups to be marked: toPermanent=true
INFO: 2021/10/14 10:25:33.103909 Retrieved backups to be marked, marking: [base_000000010000000000000006]
Check if backup base_000000010000000000000008 has to be retained and will retain 5 backups
Check if backup base_000000010000000000000004 has to be retained and will retain 5 backups
Check if backup base_000000040000000000000024 has to be retained and will retain 5 backups
Cleaning up impermanent backups
INFO: 2021/10/14 10:25:33.888634 retrieving permanent objects
INFO: 2021/10/14 10:25:34.351379 Found permanent objects: backups=map[base_000000010000000000000006:true], wals=map[000000010000000000000006:true]
INFO: 2021/10/14 10:25:34.960930 Start delete
INFO: 2021/10/14 10:25:36.243033 Objects in folder:
INFO: 2021/10/14 10:25:36.243060 will be deleted: basebackups_005/base_000000010000000000000004_backup_stop_sentinel.json
INFO: 2021/10/14 10:25:36.243072 will be deleted: basebackups_005/base_000000010000000000000008_backup_stop_sentinel.json
INFO: 2021/10/14 10:25:36.243082 will be deleted: wal_005/000000010000000000000001.lz4
INFO: 2021/10/14 10:25:36.243088 will be deleted: wal_005/000000010000000000000002.00000028.backup.lz4
INFO: 2021/10/14 10:25:36.243093 will be deleted: wal_005/000000010000000000000002.lz4
INFO: 2021/10/14 10:25:36.243106 will be deleted: wal_005/000000010000000000000003.lz4
INFO: 2021/10/14 10:25:36.243118 will be deleted: wal_005/000000010000000000000004.00000028.backup.lz4
INFO: 2021/10/14 10:25:36.243125 will be deleted: wal_005/000000010000000000000004.00000270.backup.lz4
INFO: 2021/10/14 10:25:36.243132 will be deleted: wal_005/000000010000000000000004.lz4
INFO: 2021/10/14 10:25:36.243136 will be deleted: wal_005/000000010000000000000005.lz4
INFO: 2021/10/14 10:25:36.243150 will be deleted: wal_005/000000010000000000000008.00000028.backup.lz4
INFO: 2021/10/14 10:25:36.243159 will be deleted: wal_005/000000010000000000000008.lz4
INFO: 2021/10/14 10:25:36.243166 will be deleted: wal_005/000000010000000000000009.lz4
INFO: 2021/10/14 10:25:36.243173 will be deleted: wal_005/00000001000000000000000A.lz4
INFO: 2021/10/14 10:25:36.243180 will be deleted: wal_005/00000001000000000000000B.partial.lz4
INFO: 2021/10/14 10:25:36.243190 will be deleted: wal_005/00000002000000000000000B.lz4
INFO: 2021/10/14 10:25:36.243211 will be deleted: wal_005/00000002000000000000000C.00000028.backup.lz4
INFO: 2021/10/14 10:25:36.243221 will be deleted: wal_005/00000002000000000000000C.lz4
INFO: 2021/10/14 10:25:36.243228 will be deleted: wal_005/00000002000000000000000D.lz4
INFO: 2021/10/14 10:25:36.243237 will be deleted: wal_005/00000002000000000000000E.00000028.backup.lz4
INFO: 2021/10/14 10:25:36.243248 will be deleted: wal_005/00000002000000000000000E.lz4
INFO: 2021/10/14 10:25:36.243257 will be deleted: wal_005/000000020000000000000010.00000028.backup.lz4
INFO: 2021/10/14 10:25:36.243333 will be deleted: basebackups_005/base_000000010000000000000004/metadata.json
INFO: 2021/10/14 10:25:36.243357 will be deleted: basebackups_005/base_000000010000000000000008/metadata.json
INFO: 2021/10/14 10:25:36.243375 will be deleted: basebackups_005/base_000000010000000000000004/tar_partitions/part_001.tar.lz4
INFO: 2021/10/14 10:25:36.243384 will be deleted: basebackups_005/base_000000010000000000000004/tar_partitions/part_003.tar.lz4
INFO: 2021/10/14 10:25:36.243395 will be deleted: basebackups_005/base_000000010000000000000004/tar_partitions/pg_control.tar.lz4
INFO: 2021/10/14 10:25:36.243431 will be deleted: basebackups_005/base_000000010000000000000008/tar_partitions/part_001.tar.lz4
INFO: 2021/10/14 10:25:36.243446 will be deleted: basebackups_005/base_000000010000000000000008/tar_partitions/part_003.tar.lz4
INFO: 2021/10/14 10:25:36.243452 will be deleted: basebackups_005/base_000000010000000000000008/tar_partitions/pg_control.tar.lz4
INFO: 2021/10/14 10:25:36.243459 will be deleted: basebackups_005/base_00000002000000000000000C/tar_partitions/part_001.tar.lz4
INFO: 2021/10/14 10:25:36.243466 will be deleted: basebackups_005/base_00000002000000000000000C/tar_partitions/pg_control.tar.lz4
INFO: 2021/10/14 10:25:36.243472 will be deleted: basebackups_005/base_00000002000000000000000E/tar_partitions/part_001.tar.lz4
INFO: 2021/10/14 10:25:36.243480 will be deleted: basebackups_005/base_00000002000000000000000E/tar_partitions/pg_control.tar.lz4
INFO: 2021/10/14 10:25:36.243488 will be deleted: basebackups_005/base_000000020000000000000010/tar_partitions/part_001.tar.lz4
INFO: 2021/10/14 10:25:36.243495 will be deleted: basebackups_005/base_000000020000000000000010/tar_partitions/pg_control.tar.lz4
INFO: 2021/10/14 10:25:36.243501 will be deleted: basebackups_005/base_000000020000000000000012/tar_partitions/part_001.tar.lz4
INFO: 2021/10/14 10:25:36.243510 will be deleted: basebackups_005/base_000000020000000000000012/tar_partitions/pg_control.tar.lz4
INFO: 2021/10/14 10:25:36.243523 will be deleted: basebackups_005/base_000000020000000000000014/tar_partitions/part_001.tar.lz4
INFO: 2021/10/14 10:25:36.243534 will be deleted: basebackups_005/base_000000020000000000000014/tar_partitions/pg_control.tar.lz4
Check if backup base_000000040000000000000024 has to be set permanent or impermanent
Check if backup base_000000010000000000000006 has to be set permanent or impermanent
Mark base_000000010000000000000006 as impermanent
INFO: 2021/10/14 10:25:38.373738 Retrieving previous related backups to be marked: toPermanent=false
INFO: 2021/10/14 10:25:38.652173 retrieving permanent objects
INFO: 2021/10/14 10:25:39.343033 Retrieved backups to be marked, marking: [base_000000010000000000000006]
Reconciliation of backups completed
Listing existing backups
I1014 10:25:41.801841 9217 request.go:645] Throttling request took 1.183187771s, request: GET:https://10.100.0.1:443/apis/coordination.k8s.io/v1?timeout=32s
Updating backup CR as completed
I1014 10:25:45.913577 17642 request.go:645] Throttling request took 1.184119377s, request: GET:https://10.100.0.1:443/apis/monitoring.coreos.com/v1alpha1?timeout=32s
sgbackup.stackgres.io/test-2021-10-14-10-25-08 patched
Backup CR updated as completed
Reconcile backup CRs
Reconciliation of backup CRs completed
Lock released
Edited by Xavier Sierra