RKE2 - API reachability issue observed during node rolling upgrade
This issue was observed on https://gitlab.com/sylva-projects/sylva-core/-/pipelines/1540847925 in the context of Update CAPI to v1.8.3, CAPBPR RKE2 bootstrap pr... (!2961 - merged).
The key symptom is capi controllers logging the following errors and not doing any work anymore (resulting in Machines not beeing reconciling anymore):
...
E1113 19:53:26.977255 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 19:54:29.990153 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 19:55:33.236863 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 19:56:37.626468 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 19:57:41.264219 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 19:58:44.056414 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 19:59:47.931469 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:00:51.711834 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:01:53.935922 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:02:56.326500 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:09:50.417677 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server has received too many requests and has asked us to try again later (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:11:40.610987 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:19:48.976346 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server has received too many requests and has asked us to try again later (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:27:58.303822 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server has received too many requests and has asked us to try again later (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:36:06.274253 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server has received too many requests and has asked us to try again later (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:44:13.772934 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server has received too many requests and has asked us to try again later (get leases.coordination.k8s.io controller-leader-election-capi)
E1113 20:52:22.913680 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: the server has received too many requests and has asked us to try again later (get leases.coordination.k8s.io controller-leader-election-capi)
Same logs on both of the two controllers, none of those succeeding to become leader.
We have the "previous" logs for one controller:
19:52:22.509097 1 machine_controller.go:388] "Draining node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="sylva-system/mgmt-1540847925-rke2-capm3-virt-control-plane-zrdh2" namespace="sylva-system" name="mgmt-1540847
19:52:22.569912 1 machine_controller.go:711] "WARNING: ignoring DaemonSet-managed Pods: calico-system/calico-node-2p7zd, cattle-monitoring-system/rancher-monitoring-prometheus-node-exporter-ckk2q, kube-system/rke2-ingress-nginx-controller-fzvsm, longhorn-syste
19:52:22.569953 1 machine_controller.go:742] "Drain successful" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="sylva-system/mgmt-1540847925-rke2-capm3-virt-control-plane-zrdh2" namespace="sylva-system" name="mgmt-1540
19:52:22.570093 1 recorder.go:104] "success draining Machine's node \"mgmt-1540847925-rke2-capm3-virt-management-cp-0\"" logger="events" type="Normal" object={"kind":"Machine","namespace":"sylva-system","name":"mgmt-1540847925-rke2-capm3-virt-control-plane-zrd
19:52:22.570234 1 machine_controller.go:445] "Waiting for pre-terminate hooks to succeed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="sylva-system/mgmt-1540847925-rke2-capm3-virt-control-plane-zrdh2" namespace="syl
19:52:22.570285 1 recorder.go:104] "success waiting for node volumes detaching Machine's node \"mgmt-1540847925-rke2-capm3-virt-management-cp-0\"" logger="events" type="Normal" object={"kind":"Machine","namespace":"sylva-system","name":"mgmt-1540847925-rke2-ca
19:52:34.652603 1 leaderelection.go:340] Failed to update lock optimitically: Put "https://100.73.0.1:443/apis/coordination.k8s.io/v1/namespaces/capi-system/leases/controller-leader-election-capi": context deadline exceeded, falling back to slow path
19:52:34.652700 1 leaderelection.go:347] error retrieving resource lock capi-system/controller-leader-election-capi: client rate limiter Wait returned an error: context deadline exceeded
19:52:34.652767 1 leaderelection.go:285] failed to renew lease capi-system/controller-leader-election-capi: timed out waiting for the condition
19:52:34.652873 1 main.go:373] "Problem running manager" err="leader election lost" logger="setup"
These events are very very precisely correlated with the etcd membership removal of cp-0 at 19:52:23, which is directly following by K8S API issues to retrieve its lease:
19:52:23.644161 1 workload_cluster_etcd.go:153] "Removed member: mgmt-1540847925-rke2-capm3-virt-management-cp-0-0b142fb0" controller="rke2controlplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="RKE2ControlPlane" RKE2ControlPlane="sylva-system/mgmt-1540847925-rke2-capm3-virt-control-plane" namespace="sylva-system" name="mgmt-1540847925-rke2-capm3-virt-control-plane" reconcileID="bd396b4b-84f7-47ed-9fb7-04d275bb437f" Machine="sylva-system/mgmt-1540847925-rke2-capm3-virt-control-plane-zrdh2"
19:52:23.644457 1 scale.go:188] "Removing pre-terminate hook from control plane Machine" controller="rke2controlplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="RKE2ControlPlane" RKE2ControlPlane="sylva-system/mgmt-1540847925-rke2-capm3-virt-control-plane" namespace="sylva-system" name="mgmt-1540847925-rke2-capm3-virt-control-plane" reconcileID="bd396b4b-84f7-47ed-9fb7-04d275bb437f" Machine="sylva-system/mgmt-1540847925-rke2-capm3-virt-control-plane-zrdh2"
19:52:23.715139 1 leaderelection.go:332] error retrieving resource lock rke2-control-plane-system/rke2-controlplane-manager-leader-election-capi: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:55436->127.0.0.1:2379: read: connection reset by peer 19:52:33.481784 1 leaderelection.go:332] error retrieving resource lock rke2-control-plane-system/rke2-controlplane-manager-leader-election-capi: Get "https://100.73.0.1:443/apis/coordination.k8s.io/v1/namespaces/rke2-control-plane-system/leases/rke2-controlplane-manager-leader-election-capi": context deadline exceeded
19:52:33.482002 1 leaderelection.go:285] failed to renew lease rke2-control-plane-system/rke2-controlplane-manager-leader-election-capi: timed out waiting for the condition 19:52:33.482195 1 main.go:200] "problem running manager" err="leader election lost" logger="setup"
So the chronology is:
- 2 CP nodes (cp-1 and cp-2) have been replaced during rolling update, they are in
1.29.9+rke2r1 - 1 old node remains (
v1.28.8+rke2r1), cp-0, it has been drained - at 19:52:23 etcd cluster membership for cp-0 is removed by RKE2 CP provider
- errors around K8S API leases start right after that
In node log dumps, I observe the following in rke2 logs on the 2 more recent CP nodes:
...
Nov 13 19:52:37 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:52:37Z" level=info msg="Stopped tunnel to 192.168.100.20:9345"
Nov 13 19:52:37 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:52:37Z" level=info msg="Proxy done" err="context canceled" url="wss://192.168.100.20:9345/v1-rke2/connect"
Nov 13 19:53:06 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:53:06Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:37648: dial tcp 100.72.91.217:9443: connect: connection timed out"
Nov 13 19:53:07 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:53:07Z" level=info msg="Starting managed etcd apiserver addresses controller"
Nov 13 19:53:07 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:53:07Z" level=info msg="Starting managed etcd member removal controller"
Nov 13 19:53:07 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:53:07Z" level=info msg="Starting managed etcd snapshot ConfigMap controller"
Nov 13 19:53:07 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:53:07Z" level=info msg="Starting k3s.cattle.io/v1, Kind=ETCDSnapshotFile controller"
Nov 13 19:53:07 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:53:07Z" level=info msg="Reconciling snapshot ConfigMap data"
Nov 13 19:53:08 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:53:08Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:37658: dial tcp 100.72.91.217:9443: connect: connection timed out"
Nov 13 19:53:27 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T19:53:27Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Nov 13 20:13:11 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:13:11Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:57434: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:13:13 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:13:13Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:57438: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:13:16 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:13:16Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:52014: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:13:22 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:13:22Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:52016: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:13:26 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:13:26Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:54132: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:13:31 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:13:31Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:54166: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:14:01 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:14:01Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:38824: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:14:07 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:14:07Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:56846: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:14:10 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:14:10Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:56854: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:14:12 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:14:12Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:56874: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
...
Nov 13 20:22:46 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:22:46Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:43194: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:23:23 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:23:23Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:47636: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:23:34 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:23:34Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:38384: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:23:35 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:23:35Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:43792: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:23:36 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:23:36Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:43808: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:24:04 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:24:04Z" level=warning msg="Proxy error: write failed: write tcp 127.0.0.1:9345->127.0.0.1:53182: write: broken pipe"
Nov 13 20:24:06 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:24:06Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:45780: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:24:12 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:24:12Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:45796: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:24:44 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:24:44Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:45994: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:24:47 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:24:47Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:37708: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
Nov 13 20:24:53 mgmt-1540847925-rke2-capm3-virt-management-cp-1 rke2[2010]: time="2024-11-13T20:24:53Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:37718: failed to find Session for client mgmt-1540847925-rke2-capm3-virt-management-cp-0"
...
This issue is fully similar to #1687 (closed), keeping in mind that:
- many issues have been candidate explanations for #1687 (closed)
- upstream issues regarding etcd cluster membership removal on scale down
- cases where calico Helm charts where removed by RKE2 HelmChart controller
- until those issues were solved (ie. now), we were hoping no issue would remain
This issue is a follow-up with fresh data where the initially suspected cause have been solved.