upgrade from 1.1.1 etcd/apiserver errors (RKE2 bootstrap provider 0.7.x)

My starting point was https://gitlab.com/sylva-projects/sylva-core/-/jobs/7915786264 which was a failure on update-mgmt job on a pipeline testing upgrades from 1.1.1 to main on !2959 (merged), the MR testing the bump of CAPI to 1.8 and the bump of RKE2 to bootstrap provider 0.7.x).

Artifacts have expired but I'm sharing them here instead: (artifact-7915786264.zip

I've seen this on other jobs of the same MR, but this issue does not seem to happen all the time, e.g https://gitlab.com/sylva-projects/sylva-core/-/jobs/7939560738, https://gitlab.com/sylva-projects/sylva-core/-/jobs/7939561210, https://gitlab.com/sylva-projects/sylva-core/-/pipelines/1472603361, were ok).

The run dissected below (https://gitlab.com/sylva-projects/sylva-core/-/jobs/7915786264) was with capm3, but we've had failure with exact same symptoms with capo as well (e.g. https://gitlab.com/sylva-projects/sylva-core/-/jobs/7951588127).

The final symptom causing the job failure is that the deletion of the Metal3Machine is stuck, but the issue it not specific to this controller.

The capm3 controller pods logs show that none of the two controllers can get become leader (for a long time, from 14:05 to 14:50 the time at which logs are taken):

E0925 14:50:15.671134       1 leaderelection.go:332] error retrieving resource lock capm3-system/controller-leader-election-capm3: the server has received too many requests and has asked us to try again later (get leases.coordination.k8s.io controller-leader-election-capm3)

Diving into the logs show that many/most controllers are impacted (on all nodes), with this error or other issues related to accessing k8s API:

kyverno-admission-controller-6b7f579cdf-2m46s (on md0)

starting at 14:05:55 certificate issues

2024-09-25T14:05:45Z    ERROR   updaterequest-generator updaterequest/generator.go:87   failed to create UpdateRequest, retrying        {"ruleType": "mutate", "resource": "cluster.x-k8s.io/v1beta1/Machine/sylva-system/mgmt-1468717013-rke2-capm3-virt-control-plane-z79tn", "name": "ur-", "namespace": "kyverno", "error": "rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:57234->127.0.0.1:2379: read: connection reset by peer"}
2024-09-25T14:05:51Z    ERROR   setup.runtime-checks    runtime/utils.go:101    failed to validate certificates {"error": "Get \"https://100.73.0.1:443/api/v1/namespaces/kyverno/secrets/kyverno-svc.kyverno.svc.kyverno-tls-ca\": context canceled"}
2024-09-25T14:05:55Z    ERROR   DefaultContextLoaderFactory     context/deferred.go:42  failed to load data     {"name": "minavailable", "error": "failed to fetch data for APICall: failed to GET resource with raw url\n: /apis/policy/v1/namespaces/calico-system/poddisruptionbudgets: Get \"https://100.73.0.1:443/apis/policy/v1/namespaces/calico-system/poddisruptionbudgets\": context canceled"}
2024-09-25T14:06:01Z    ERROR   setup.runtime-checks    runtime/utils.go:101    failed to validate certificates {"error": "Get \"https://100.73.0.1:443/api/v1/namespaces/kyverno/secrets/kyverno-svc.kyverno.svc.kyverno-tls-ca\": context canceled"}
2024-09-25T14:06:03Z    INFO    webhooks.server logging/log.go:184      2024/09/25 14:06:03 http: TLS handshake error from 100.72.92.192:37630: EOF
2024-09-25T14:06:11Z    ERROR   setup.runtime-checks    runtime/utils.go:101    failed to validate certificates {"error": "Get \"https://100.73.0.1:443/api/v1/namespaces/kyverno/secrets/kyverno-svc.kyverno.svc.kyverno-tls-ca\": context canceled"}
2024-09-25T14:06:21Z    ERROR   setup.runtime-checks    runtime/utils.go:101    failed to validate certificates {"error": "Get \"https://100.73.0.1:443/api/v1/namespaces/kyverno/secrets/kyverno-svc.kyverno.svc.kyverno-tls-ca\": context canceled"}
...
2024-09-25T14:50:38Z    ERROR   klog    watch/watcher.go:115    Watch failed    {"error": "the server has received too many requests and has asked us to try again later"}
...

These errors on this pod keep flowing until the end of the run:

2024-09-25T14:50:37Z    ERROR   setup.runtime-checks    runtime/utils.go:101    failed to validate certificates {"error": "Get \"https://100.73.0.1:443/api/v1/namespaces/kyverno/secrets/kyverno-svc.kyverno.svc.kyverno-tls-ca\": context canceled"}

Overview of the timeline:

cp-1 has not been roll'updated yet
cp-0 has rolling update apparently completed fine around 14:03-14:05
cp-2 rebuild starts around 14:05:35
- draining goes fine
- etcd member is removed at 14:05:45
most of errors happen at around 16:06:53, but the cause could be before, of course

Etcd servers are full of such errors, starting at 14:06:57, and then never stopping:

{"level":"warn","ts":"2024-09-25T14:06:57.558296Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"192.168.100.20:45774","server-name":"","error":"remote error: tls: bad certificate"}
{"level":"warn","ts":"2024-09-25T14:06:57.558406Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"192.168.100.20:45790","server-name":"","error":"remote error: tls: bad certificate"}

We discussed this case with @feleouet, but did not reach a conclusion yet.

Edited Sep 30, 2024 by Thomas Morin