first version of cluster-maxunavailable controller
This MR is a first shot at implementing a controller interacting with Cluster API controllers to ensure a "maxUnavailable 1" behavior for all Machines of a Cluster; ie. ensuring that at any moment in time, no more than one Machine is being rebuilt.
This is meant to address sylva-projects/sylva-core#2371 and sylva-projects/sylva-core#2484 (closed).
The base concept (imagined together with @feleouet) is:
- all Machines are preliminary annotated with a pre-drain Cluster API hook: this ensures that until this hook is removed, a Machine will not be touched by Cluster API controllers (drain is the first thing that those controllers do before doing anything else)
- how the Cluster API controllers (control plane controller, MachineDeployment controllers) decide which Machine is to be deleted is unchanged
- among the Machines marked for deletion based on actions by the Cluster API controllers, our cluster-max-unavailable controller will then pick one Machine at a time and remove its pre-drain hook, to let Cluster API actually drain it and rebuild it
- to ensure that we only let "one machine a time" be drained and rebuilt, the controller always checks that CP and MDs are all back to their target state before picking a new machine and letting CAPI drain it
To properly cover sylva-projects/sylva-core#2371, the controller ensures that it will not pick an MD machine for deletion if its MD could not scale up due to failing the MachineSet Pre-Flight Checks -- if we allowed that we would have a deadlock (removing an MD machine, waiting for a replacement machine to be back up before moving forward on the cluster node rolling update, but since the new machine would fail the preflight checks it would not come up until the CP would be upgraded and stable, which would never happen because our controller would not let a CP machine be drained). Note that if https://github.com/kubernetes-sigs/cluster-api/issues/12187 is addressed upstream, this would not be needed.
The controller is called cluster-maxunavailable with the idea that it would allow to choose a maximum value for the number of unavailable machines in the cluster -- but the current implementation only implements "at most one machine is unavailable", the number actually is not configurable.
The controller will only apply this behavior on Clusters annotated with cluster-maxunavailable.sylva.org/enabled.
If the concept works for us, this idea could become a proposal to share with the Cluster API community.
Integration in sylva-core
sylva-core integration MR: sylva-projects/sylva-core!4853 (merged)
This controller should work easily with previous versions of Sylva.
Test / demo
This implementation can be tested with:
$ make
$ bin/manager
(assuming the KUBECONFIG is set and points to the management cluster)
With this, triggering a rolling upgrades on the nodes of a Cluster configured with maxSurge zero (e.g. a CAPO cluster) will illustrate that the rebuild of two nodes never happens in parallel.
Events are generated to make it easier to see why Machines aren't draining.
$ k describe Machine/management-cluster-md-ubuntu-wz5lh-kq62g | tail
Observed Generation: 4
Reason: WaitingForPreDrainHook
Status: True
Type: Deleting
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulSetNodeRef 66m (x2 over 66m) machine-controller management-cluster-md-ubuntu-wz5lh-kq62g
Normal not-selected 8s (x2 over 8s) cluster-maxunavailable Machine not selected for being drained: Some pre-flight checks failed on machineset: [MachineSet version (1.32.8+rke2r1) and ControlPlane version (1.31.8+rke2r1) do not conform to the kubernetes version skew policy as MachineSet version is higher than ControlPlane version ("KubernetesVersionSkew" preflight check failed)]
$ k describe cluster management-cluster | tail
Normal last-action-too-recent 22m (x7 over 22m) cluster-maxunavailable Would have pre-drain hook on Machine management-cluster-md0-k5nqg-vn6hg, but last action too recent
Normal removing-pre-drain-hook 18m cluster-maxunavailable Removing pre-drain hook on Machine management-cluster-md0-k5nqg-vn6hg
Normal pre-drain-hook-removed 18m cluster-maxunavailable Removing pre-drain hook on Machine management-cluster-md0-k5nqg-vn6hg
Normal noop-machine-deletion-in-progress 18m (x5 over 18m) cluster-maxunavailable There is a deleting machine free from our pre-drain hook, no need to remove hook on any machine
Normal noop-machine-deletion-in-progress 16m (x13 over 18m) cluster-maxunavailable There is a deleting machine free from our pre-drain hook, no need to remove hook on any machine
Normal no-machine-can-be-removed 4m10s (x11 over 8m30s) cluster-maxunavailable No machine selected for drain among deleting machines held by our pre-drain hook (ongoing node rolling update, or failing pre-flight-check)
Normal removing-pre-drain-hook 4m10s cluster-maxunavailable Removing pre-drain hook on Machine management-cluster-md-ubuntu-wz5lh-kq62g
Normal pre-drain-hook-removed 4m10s cluster-maxunavailable Removing pre-drain hook on Machine management-cluster-md-ubuntu-wz5lh-kq62g
Normal last-action-too-recent 4m10s cluster-maxunavailable Would have pre-drain hook on Machine management-cluster-md-ubuntu-wz5lh-kq62g, but last action too recent
Normal noop-machine-deletion-in-progress 3m19s (x12 over 4m10s) cluster-maxunavailable There is a deleting machine free from our pre-drain hook, no need to remove hook on any machine
How to avoid two subsequent Machine drain in a short time
There would be the following race condition:
- cluster-maxunavailable controller observe that CP/MDs are in a state that allows draining a Machine
- picks a Machine for draining
- if it reconciles at once, the CP/MDs will possibly sill be in the same a state (because their controllers would not have reacted yet and updated their
status), also allowing to drain a Machine - and the cluster-maxunavailable would pick another Machine at once
To avoid this, my implementation does the following:
- on removing pre-drain hook on a Machine, a
cluster-maxunavailable.sylva.org/last-pre-drain-hook-delete-actionannotation is put onClusterobject - before removing pre-drain hook on a Machine, a check is made that the timestamp in
cluster-maxunavailable.sylva.org/last-pre-drain-hook-delete-actionannotation is at least 3 minutes
This would of course not work reliably if our controller was allowed to have multiple reconciliation requests in parallel, so the controller is configured with MaxConcurrentReconciles: 1.
Future
- we can possibly toughen the logic to take into account cases were Machines are ready, but where the Node is not healthy
- the "cluster maxunavailable 1" behavior could be automatically applied based on whether maxSurge 0 is set or not (unconditionally remove hook on any deleting Machine from an MD/CP not having maxSurge 0)
- I (@feleouet) had proposed a solution in https://gitlab.com/sylva-projects/sylva-elements/misc-controllers-suite/-/merge_requests/1/diffs?commit_id=6477b0452c69b417f476b3a7c772dda5f7e4538c, but reverted for now. I'll re-introduce it later.
This MR should merge after !2 which initiates the kubebuilder project