need to ensure that MetalLB speakers stop advertising on node drain

Context

This issue is partly speculative.

My starting point is that I'm trying to find the cause of cases where the cluster VIP becomes unreachable during node rolling updates with recent versions of the RKE2 control plane provider.

For example this capm3 update-workload-cluster job: https://gitlab.com/sylva-projects/sylva-core/-/jobs/8379160745 ; in which we see plenty of issues related to accessing the workload cluster VIP, and where we see that the VIP seems served by cp-0 node which has just been torn down (ServiceL2Statuses.metallb.io.summary.txt).

Hypothetical cause

Today our MetaLB speaker pods are not evicted on node drain (because they're part of a DaemonSet that gives them a toleration).

This is possibly a problem because in L2 mode, MetaLB "leader election" (documented here) requires that all speaker pods have same information from k8s API, which is an assumption that does not seem to always hold during a node rolling update.

For instance, recently the CAPI RKE2 control plane provider (cabpr) has evolved and on a Node teardown (e.g. during a CP rolling update), it will remove the membership of the etcd member on the removed CP node (right after drain).

I'm hypothesizing that after this, the following occurs:

the local API server can't provide up to date information on the cluster API resource
same for kube-proxy (which reaches the API server via to 127.0.0.1:6443)
hence the Kubernetes services will start dysfunctioning (the list of endpoints for a service may include wrong endpoint and may lack new ones)
since MetalLB relies on the kubernetes Service to access k8s to known about all speakers, MetalLB may also have unreliable access to up to date information:
- it might round-rob and sometimes hit a working apiserver, but it may hit its own API server, and it may take a while time to timeout /retry etc
- ... as said above this may lead the metallb speaker cluster to take a wrong decision: if the speaker on the local node was holding the VIP, it might hold it (but won't be able to actually serve things)

Kubeadm has also the behavior of removing etcmembership of a removed node, but then things differ (details to dig and confirm):

metallb isn't used, kube-vip is used instead, and kube-vip relies on more traditional k8s api leases which presumably can't lead a disconnected node from holding the VIP
with kubeadm, kube-proxy points to the VIP rather than the local 127.0.0.1:6443 so it would keep accessing the cluster even if the local API server is NOK

Related points

This issue could be partially hidden by the fact that the Machine of misbehaving MetalLB speaker from a recentely torn down Node erroneously holding the VIP, should quickly be deleted by CAPI .... BUT this does not happen because in CAPI Machine controller, the inframachine deletion step is after the observation that Drain is done, and the observation that drain is done can't be done if the Node resource can't be read... which happens (at least for a workload cluster) via the VIP

What to do with MetaLB during drains ?

Generally speaking the leader election change will be smoother if the metallb speaker pod is torn down cleanly during node drain.

/cc @feleouet @cristian.manda

Edited Nov 15, 2024 by Thomas Morin