Rolling upgrade of MD failing with "Node password rejected, duplicate hostname" error

This issue has been observed on various CI runs : https://gitlab.com/sylva-projects/sylva-core/-/jobs/7290773682

The node is stuck in the cloud-init (/var/lib/cloud/instance/scripts/runcmd)

rke2-capm3-virt-management-md-0 systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 sh[1267]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 sh[1268]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 kernel: [   59.842859] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 kernel: [   59.844141] Bridge firewalling registered
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=warning msg="cis-1.23 profile is deprecated and will be removed in v1.29. Please use cis instead."
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=info msg="Applying Pod Security Admission Configuration"
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=warning msg="cis-1.23 profile is deprecated and will be removed in v1.29. Please use cis instead."
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=info msg="Starting rke2 agent v1.28.8+rke2r1 (42cab2f61939504cb17073e47deaea0b29fe2c1b)"
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 192.168.100.2:9345"
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=info msg="Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [192.168.100.2:9345] [default: 192.168.100.2:9345]"
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: 192.168.100.2:6443"
Jul  8 22:24:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:20Z" level=info msg="Running load balancer rke2-api-server-agent-load-balancer 127.0.0.1:6443 -> [192.168.100.2:6443] [default: 192.168.100.2:6443]"
Jul  8 22:24:21 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:21Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:24:31 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:31Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:24:41 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:41Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:24:52 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:24:52Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:25:02 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:25:02Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:25:12 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:25:12Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:25:20 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:25:20Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:25:28 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:25:28Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:25:34 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:25:34Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:25:43 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:25:43Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:25:51 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:25:51Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:26:01 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:26:01Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:26:09 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:26:09Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:26:19 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:26:19Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:26:24 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:26:24Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:26:35 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:26:35Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:26:45 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:26:45Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:26:52 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:26:52Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
Jul  8 22:26:58 mgmt-1365192642-rke2-capm3-virt-management-md-0 rke2[1273]: time="2024-07-08T22:26:58Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
...

The error msg :

Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag

seems related to the fact that we set nodeReuse and so reuse the same hostname (BMH name) for our K8S nodes.

This issue (https://github.com/k3s-io/k3s/issues/802) mentions that the rke2 agent will use password/token stored in a K8S secret (the secret name is prefixed by node name). (https://docs.k3s.io/architecture#how-agent-node-registration-works)

kube-system                                                       mgmt-1365192642-rke2-capm3-virt-management-cp-0.node-password.rke2        Opaque                                        1      28m
kube-system                                                       mgmt-1365192642-rke2-capm3-virt-management-cp-1.node-password.rke2        Opaque                                        1      125m
kube-system                                                       mgmt-1365192642-rke2-capm3-virt-management-cp-2.node-password.rke2        Opaque                                        1      43m
kube-system                                                       mgmt-1365192642-rke2-capm3-virt-management-md-0.node-password.rke2        Opaque                                        1      56m
sylva-system      mgmt-1365192642-rke2-capm3-virt-md0-9gh4h-pnghk       mgmt-1365192642-rke2-capm3-virt                                                                                                                                                                                  Provisioning   57m    v1.28.8+rke2r1

In our case, the secret seems to have been recreated

Logs from cp-1 (syslog):

Jul  8 23:15:51 mgmt-1365192642-rke2-capm3-virt-management-cp-1 rke2[1272]: time="2024-07-08T23:15:51Z" level=error msg="Sending HTTP 403 response to 192.168.100.21:38064: unable to verify password for node mgmt-1365192642-rke2-capm3-virt-management-md-0: hash does not match"
Jul  8 23:15:58 mgmt-1365192642-rke2-capm3-virt-management-cp-1 rke2[1272]: time="2024-07-08T23:15:58Z" level=error msg="Sending HTTP 403 response to 192.168.100.21:65257: unable to verify password for node mgmt-1365192642-rke2-capm3-virt-management-md-0: hash does not match"
Jul  8 23:16:17 mgmt-1365192642-rke2-capm3-virt-management-cp-1 rke2[1272]: time="2024-07-08T23:16:17Z" level=error msg="Sending HTTP 403 response to 192.168.100.21:45821: unable to verify password for node mgmt-1365192642-rke2-capm3-virt-management-md-0: hash does not match"
Jul  8 23:16:18 mgmt-1365192642-rke2-capm3-virt-management-cp-1 systemd[1]: run-containerd-runc-k8s.io-a5b6e432445cebe4c5832ca5dd278856d72f419723dc346f4d5bb2da1eebfae8-runc.icpnOb.mount: Deactivated successfully.
Jul  8 23:16:33 mgmt-1365192642-rke2-capm3-virt-management-cp-1 rke2[1272]: time="2024-07-08T23:16:33Z" level=error msg="Sending HTTP 403 response to 192.168.100.21:53479: unable to verify password for node mgmt-1365192642-rke2-capm3-virt-management-md-0: hash does not match"
Jul  8 23:16:51 mgmt-1365192642-rke2-capm3-virt-management-cp-1 systemd[1]: run-containerd-runc-k8s.io-fdd790c39c8664400b94b0552ee6e6fd0ce82382516045e9283e02cc5a28c685-runc.BLIoKD.mount: Deactivated successfully.
Jul  8 23:17:13 mgmt-1365192642-rke2-capm3-virt-management-cp-1 rke2[1272]: time="2024-07-08T23:17:13Z" level=error msg="Sending HTTP 403 response to 192.168.100.21:42340: unable to verify password for node mgmt-1365192642-rke2-capm3-virt-management-md-0: hash does not match"
Jul  8 23:17:18 mgmt-1365192642-rke2-capm3-virt-management-cp-1 systemd[1]: run-containerd-runc-k8s.io-a5b6e432445cebe4c5832ca5dd278856d72f419723dc346f4d5bb2da1eebfae8-runc.KAGnke.mount: Deactivated successfully.
Edited Jul 09, 2024 by Remi Le Trocquer
Assignee Loading
Time tracking Loading