first version of cluster-maxunavailable controller (!1) · Merge requests · Sylva-projects / sylva-elements / misc-controllers-suite

This MR is a first shot at implementing a controller interacting with Cluster API controllers to ensure a "maxUnavailable 1" behavior for all Machines of a Cluster; ie. ensuring that at any moment in time, no more than one Machine is being rebuilt.

This is meant to address sylva-projects/sylva-core#2371 and sylva-projects/sylva-core#2484 (closed).

The base concept (imagined together with @feleouet) is:

all Machines are preliminary annotated with a pre-drain Cluster API hook: this ensures that until this hook is removed, a Machine will not be touched by Cluster API controllers (drain is the first thing that those controllers do before doing anything else)
how the Cluster API controllers (control plane controller, MachineDeployment controllers) decide which Machine is to be deleted is unchanged
among the Machines marked for deletion based on actions by the Cluster API controllers, our cluster-max-unavailable controller will then pick one Machine at a time and remove its pre-drain hook, to let Cluster API actually drain it and rebuild it
- to ensure that we only let "one machine a time" be drained and rebuilt, the controller always checks that CP and MDs are all back to their target state before picking a new machine and letting CAPI drain it

To properly cover sylva-projects/sylva-core#2371, the controller ensures that it will not pick an MD machine for deletion if its MD could not scale up due to failing the MachineSet Pre-Flight Checks -- if we allowed that we would have a deadlock (removing an MD machine, waiting for a replacement machine to be back up before moving forward on the cluster node rolling update, but since the new machine would fail the preflight checks it would not come up until the CP would be upgraded and stable, which would never happen because our controller would not let a CP machine be drained). Note that if https://github.com/kubernetes-sigs/cluster-api/issues/12187 is addressed upstream, this would not be needed.

The controller is called cluster-maxunavailable with the idea that it would allow to choose a maximum value for the number of unavailable machines in the cluster -- but the current implementation only implements "at most one machine is unavailable", the number actually is not configurable.

The controller will only apply this behavior on Clusters annotated with cluster-maxunavailable.sylva.org/enabled.

If the concept works for us, this idea could become a proposal to share with the Cluster API community.

Integration in sylva-core

sylva-core integration MR: sylva-projects/sylva-core!4853 (merged)

This controller should work easily with previous versions of Sylva.

Test / demo

This implementation can be tested with:

$ make
$ bin/manager

(assuming the KUBECONFIG is set and points to the management cluster)

With this, triggering a rolling upgrades on the nodes of a Cluster configured with maxSurge zero (e.g. a CAPO cluster) will illustrate that the rebuild of two nodes never happens in parallel.

Events are generated to make it easier to see why Machines aren't draining.

$ k describe Machine/management-cluster-md-ubuntu-wz5lh-kq62g | tail
      Observed Generation:   4
      Reason:                WaitingForPreDrainHook
      Status:                True
      Type:                  Deleting
Events:
  Type    Reason                Age                From                    Message
  ----    ------                ----               ----                    -------
  Normal  SuccessfulSetNodeRef  66m (x2 over 66m)  machine-controller      management-cluster-md-ubuntu-wz5lh-kq62g
  Normal  not-selected          8s (x2 over 8s)    cluster-maxunavailable  Machine not selected for being drained: Some pre-flight checks failed on machineset: [MachineSet version (1.32.8+rke2r1) and ControlPlane version (1.31.8+rke2r1) do not conform to the kubernetes version skew policy as MachineSet version is higher than ControlPlane version ("KubernetesVersionSkew" preflight check failed)]

$ k describe cluster management-cluster | tail
  Normal  last-action-too-recent             22m (x7 over 22m)       cluster-maxunavailable  Would have pre-drain hook on Machine management-cluster-md0-k5nqg-vn6hg, but last action too recent
  Normal  removing-pre-drain-hook            18m                     cluster-maxunavailable  Removing pre-drain hook on Machine management-cluster-md0-k5nqg-vn6hg
  Normal  pre-drain-hook-removed             18m                     cluster-maxunavailable  Removing pre-drain hook on Machine management-cluster-md0-k5nqg-vn6hg
  Normal  noop-machine-deletion-in-progress  18m (x5 over 18m)       cluster-maxunavailable  There is a deleting machine free from our pre-drain hook, no need to remove hook on any machine
  Normal  noop-machine-deletion-in-progress  16m (x13 over 18m)      cluster-maxunavailable  There is a deleting machine free from our pre-drain hook, no need to remove hook on any machine
  Normal  no-machine-can-be-removed          4m10s (x11 over 8m30s)  cluster-maxunavailable  No machine selected for drain among deleting machines held by our pre-drain hook (ongoing node rolling update, or failing pre-flight-check)
  Normal  removing-pre-drain-hook            4m10s                   cluster-maxunavailable  Removing pre-drain hook on Machine management-cluster-md-ubuntu-wz5lh-kq62g
  Normal  pre-drain-hook-removed             4m10s                   cluster-maxunavailable  Removing pre-drain hook on Machine management-cluster-md-ubuntu-wz5lh-kq62g
  Normal  last-action-too-recent             4m10s                   cluster-maxunavailable  Would have pre-drain hook on Machine management-cluster-md-ubuntu-wz5lh-kq62g, but last action too recent
  Normal  noop-machine-deletion-in-progress  3m19s (x12 over 4m10s)  cluster-maxunavailable  There is a deleting machine free from our pre-drain hook, no need to remove hook on any machine

How to avoid two subsequent Machine drain in a short time

There would be the following race condition:

cluster-maxunavailable controller observe that CP/MDs are in a state that allows draining a Machine
picks a Machine for draining
if it reconciles at once, the CP/MDs will possibly sill be in the same a state (because their controllers would not have reacted yet and updated their status), also allowing to drain a Machine
and the cluster-maxunavailable would pick another Machine at once

To avoid this, my implementation does the following:

on removing pre-drain hook on a Machine, a cluster-maxunavailable.sylva.org/last-pre-drain-hook-delete-action annotation is put on Cluster object
before removing pre-drain hook on a Machine, a check is made that the timestamp in cluster-maxunavailable.sylva.org/last-pre-drain-hook-delete-action annotation is at least 3 minutes

This would of course not work reliably if our controller was allowed to have multiple reconciliation requests in parallel, so the controller is configured with MaxConcurrentReconciles: 1.

Future

we can possibly toughen the logic to take into account cases were Machines are ready, but where the Node is not healthy
the "cluster maxunavailable 1" behavior could be automatically applied based on whether maxSurge 0 is set or not (unconditionally remove hook on any deleting Machine from an MD/CP not having maxSurge 0)
- I (@feleouet) had proposed a solution in https://gitlab.com/sylva-projects/sylva-elements/misc-controllers-suite/-/merge_requests/1/diffs?commit_id=6477b0452c69b417f476b3a7c772dda5f7e4538c, but reverted for now. I'll re-introduce it later.

This MR should merge after !2 which initiates the kubebuilder project

Edited Jul 29, 2025 by Thomas Morin

first version of cluster-maxunavailable controller

Integration in sylva-core

Test / demo

How to avoid two subsequent Machine drain in a short time

Future

Merge request reports