refactor cluster-maxunavailable controller to not depend on CP/MD status to determine how many Machines are ok

Related to sylva-projects/sylva-core#3234

This MR refactors the cluster-maxunavailable controller to not anymore depend on the status.(available|ready)Replicas of MachineDeployments and XXXControlPlane resources.

Depending on these statuses has indeed proven to not be a reliable way of telling if the number of CP/MD Machines it too low to allow draining/deleting an Machine, and for many reasons:

  • (1) we observed that the different CP/MD controller do not necessarily have status.(available|ready)Replicas fields that reliably reflect the fact a Machine is being drained
  • (2) even if they had, between the removal of the cluster-maxunavailable pre-drain hook on a Machine and the update of the status of the corresponding CP or MD resource, there would possibly be some time, leaving the possibility of race-conditions
  • (3) even without looking a race-condition aspects, there are also corner cases where the CP or MD controller would not be updating these statuses for a while (stopped, slow, restarting, renewing its lease, etc.)

With this refactoring, before selecting a Machine to be drained and deleted, the cluster-maxunavailable controller will look at all the Machines of the cluster and determine if enough of them are "ok":

  • "enough of them" means more than "the total of Machines desired for the cluster, minus one" (total of Machines desired for the cluster is: total of spec.replicas over CP and all MDs)
  • "ok" means "NodeHealthy and not being actively deleting", with "actively deleting" meaning "not having the controller pre-drain hook"

This new approach is robust because the criteria that is used to conclude that enough machines are ok does not depend on external factors: only our controller removes the cluster-maxunavailable pre-drain hook, and it will not do that unless enough machines are "ok".

The initial implementation suffered from lack of automated testing, so for this work I introduced a way to check that this controller does its work reliably: in sylva-core CI, we'll now have a CI job relying on CAPI resources Prometheus metrics and a Prometheus/Thanos alert checking that there are always Machines that are "NodeHealthy and not being drained". This work was done in sylva-projects/sylva-core!6523, where this evolution was hence tested, giving good results.

In this MR I'm also removing code that was intended to avoid race conditions (enforcing a 3-minute minimum delay between two actions on a given cluster). This code was most probably hiding the fact that CP/MD controller do not necessarily have status.(available|ready)Replicas fields that reliably reflect the fact a Machine is being drained, which resulted in not seing the design problem in local manual tests where the drain time was shorter than the 3-minute delay.

A side-effect of this MR is that it does not need to make any assumption about what is in the status of XXXControlPlane resources, or about the semantics behind status.(available|ready)Replicas, or about when/how their are updated. Any discrepancy between CAPI bootstrap/CP providers on those aspects will not challenge the correctness of cluster-maxunavailable functionality.

this MR was tested in sylva-projects/sylva-core!6523

💡 Reviewer hint: this MR is meant to be reviewed one commit at a time 💡

Edited by Thomas Morin

Merge request reports

Loading