Node drain blocked with pods stuck in terminating state during rke2 rolling updates (etcd member removed from cluster)
Summary
The issue has been observed in upgrade-management-cluster job for capm3-ha-rke2-virt-ubuntu nightly test: https://gitlab.com/sylva-projects/sylva-core/-/jobs/7747710665 (It also apparently appears on capo-rke2-ubuntu, see https://gitlab.com/sylva-projects/sylva-core/-/jobs/7744506073)
Lots of pods are stuck in Terminated state on node mgmt-1440551165-rke2-capm3-virt-management-cp-1 (all pods that were drained)
If we look at the node, we wan see that kubelet has stopped to report status:
spec:
podCIDR: 100.72.3.0/24
podCIDRs:
- 100.72.3.0/24
providerID: metal3://sylva-system/mgmt-1440551165-rke2-capm3-virt-management-cp-1/mgmt-1440551165-rke2-capm3-virt-cp-af8bd00850-x5v6f
taints:
- effect: NoSchedule
key: node.kubernetes.io/unreachable
timeAdded: "2024-09-04T21:52:01Z"
- effect: NoExecute
key: node.kubernetes.io/unreachable
timeAdded: "2024-09-04T21:52:09Z"
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
timeAdded: "2024-09-04T21:54:25Z"
unschedulable: true
status:
[...]
conditions:
- lastHeartbeatTime: "2024-09-04T21:49:39Z" # <<< last Heartbeat
lastTransitionTime: "2024-09-04T21:52:01Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: Ready
While looking at kubelet logs, we see that it starts failing to reach the API at 21:51:41 :
E0904 21:51:41.639187 1995 kubelet_node_status.go:540] "Error updating node status, will retry" err="failed to patch status \"" for node \"mgmt-1440551165-rke2-capm3-virt-management-cp-1\": Patch \"https://127.0.0.1:6443/api/v1/nodes/mgmt-1440551165-rke2-capm3-virt-management-cp-1/status?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
It can be explained by the fact that api-server is in crashLoop, failing to reach etcd:
2024-09-04T22:50:18.161588374Z stderr F W0904 22:50:18.161465 1 logging.go:59] [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
2024-09-04T22:50:18.171177365Z stderr F W0904 22:50:18.171091 1 logging.go:59] [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
2024-09-04T22:50:19.67234822Z stderr F W0904 22:50:19.672219 1 logging.go:59] [core] [Channel #1 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
2024-09-04T22:50:19.783642922Z stderr F W0904 22:50:19.783530 1 logging.go:59] [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
2024-09-04T22:50:19.953395062Z stderr F W0904 22:50:19.953280 1 logging.go:59] [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
Which is unavailable as local etcd node has been removed from cluster by capi controller (as described in #1420 (closed)).
On kubeadm, we don't have the same issue as kubelet is using the VIP to reach the api-server.
This issue was maybe hidden by the drainTimeout that was previously set on nodes: sylva-projects/sylva-elements/helm-charts/sylva-capi-cluster!421 (merged)