cinder-backup graceful restart
Summary
currently the cinder-backup statefulset will restart the pods independently if a backup is currently running on the pod or not. This causes the currently running backups to be aborted and requires the openstack users to restart them.
Use Cases
I want to be able to rollout changes (config changes or software updates) to cinder and cinder-backup without any impact to running backups. Backups might be created of large volumes so waiting for a backup to finish might take multiple hours
Proposal
- add a stop lifecycle hook to the cinder-backup pods that
- disables the agent in cinder
- waits until all running backups are finished
- add a start script that enables the agent in cinder after startup
- specifiy a sufficiently large stop timeout for the pod
Alternative proposal
- set the
updateStrategy
of the statefulset toonDelete
- build functionality in the operator to do the following on an update
- update the statefulset
- disable one of the backup agents out of the statefulset
- wait for it to finish all running backups
- delete the pod and wait for recreation
- reenable the backup agent
- repeat until all pods are done
We decided against this option
Specification
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this section are to be interpreted as described in RFC 2119.
- MUST add a start script to the cinder image
- the script MUST start cinder-backup and wait for it to come up in the
volume service list
- the script MUST then enable cinder-backup using the cinder api (by using the credentials cinder-backup has anyway)
- the script MUST start cinder-backup and wait for it to come up in the
- MUST add a stop script to the cinder image
- the script MUST disable cinder-backup using the cinder api (by using the credentials cinder-backup has anyway)
- the script MUST wait until the current cinder-backup process is no longer processing any backup
- the script MUST then stop cinder-backup and itself
- MUST add a stop lifecycleHook to the cinder-backup statefulset to call the stop script
- MUST call the start-script instead of cinder-backup directly (in the statefulset)
- MUST set a stop timeout for the cinder-backup pod that is sufficiently large (at least 1h)
- SHOULD make the timeout configurable
Edited by Felix Huettner