RKE2 agent loadbalancer proxy errors during node rolling updates
On my dev environment, when triggering node rolling updates, I frequently observed cases where RKE2 agent loses connection to the control plane nodes.
The typical log entries are:
root@management-cluster-md-md0-1ba074d313-qljfm:/var/lib/rancher/rke2/agent/etc# journalctl -xeu rke2-agent | tail
Jun 14 12:37:02 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:02Z" level=info msg="Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:04 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:04Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 172.20.129.65:9345: connect: no route to host"
Jun 14 12:37:04 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:04Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 172.20.129.65:9345: connect: no route to host" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:05 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:05Z" level=info msg="Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:07 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:07Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 172.20.129.65:9345: connect: no route to host"
Jun 14 12:37:07 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:07Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 172.20.129.65:9345: connect: no route to host" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:08 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:08Z" level=info msg="Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:10 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:10Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 172.20.129.65:9345: connect: no route to host"
Jun 14 12:37:10 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:10Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 172.20.129.65:9345: connect: no route to host" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:11 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:11Z" level=info msg="Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"
"Failed to connect to proxy. Empty dialer response" error="dial tcp 172.20.129.65:9345: connect: no route to host"
"Remotedialer proxy error; reconecting..." error="dial tcp 172.20.129.65:9345: connect: no route to host" url="wss://172.20.129.65:9345/v1-rke2/connect"
"Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"
I've filed an upstream issue here with much more details.
EDIT: an upstream issue already existed, https://github.com/rancher/rke2/issues/5949 in fact, what seemed to be a duplicate perhaps isn't one, the fix which was released in RKE2 1.28.11-rc3 does not solve our issue
Edited by Thomas Morin