Skip to content

failure draining node, CAPI failing to access CAPO and CABPR resources

https://gitlab.com/sylva-projects/sylva-core/-/jobs/7748962509

Failure draining node mgmt-1440817843-rke2-capo-cp-1867667a31-q6lgc, Machine mgmt-1440817843-rke2-capo-control-plane-mw52f:

E0905 04:24:33.953472       1 controller.go:329] "Reconciler error" err="unable to get node mgmt-1440817843-rke2-capo-cp-1867667a31-q6lgc: 
Get \"https://100.73.0.1:443/api/v1/nodes/mgmt-1440817843-rke2-capo-cp-1867667a31-q6lgc?timeout=10s\": 
net/http: request canceled (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="sylva-system/mgmt-1440817843-rke2-capo-control-plane-mw52f" namespace="sylva-system" name="mgmt-1440817843-rke2-capo-control-plane-mw52f" reconcileID="68e44ce7-4cba-4b23-befb-f57096b0f0eb"
E0905 04:24:33.961208       1 controller.go:329] "Reconciler error"
 err="failed to check if Kubernetes Node deletion is allowed: 
failed to retrieve RKE2ControlPlane external object \"sylva-system\"/\"mgmt-1440817843-rke2-capo-control-plane\": 
Internal error occurred: error resolving resource" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="sylva-system/mgmt-1440817843-rke2-capo-control-plane-mw52f" namespace="sylva-system" name="mgmt-1440817843-rke2-capo-control-plane-mw52f" reconcileID="05d3d818-a212-4f03-bd1c-12177b118f99"

and then it's looping on errors like:

E0905 05:14:09.397665       1 controller.go:329] "Reconciler error" 
err="failed to check if Kubernetes Node deletion is allowed: 
failed to retrieve RKE2ControlPlane external object \"sylva-system\"/\"mgmt-1440817843-rke2-capo-control-plane\": 
Internal error occurred: error resolving resource" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="sylva-system/mgmt-1440817843-rke2-capo-control-plane-mw52f" namespace="sylva-system" name="mgmt-1440817843-rke2-capo-control-plane-mw52f" reconcileID="76e4633c-8873-4f4c-b16d-ff56bdc18884"

starting at approx the same time (04:24), apiserver logs are full of:

W0905 04:24:31.327823       1 cacher.go:172] Terminating all watchers from cacher rke2configs.bootstrap.cluster.x-k8s.io
E0905 04:24:31.333219       1 customresource_handler.go:301] 
unable to load root certificates: unable to parse bytes as PEM block
E0905 04:24:31.338250       1 customresource_handler.go:301] 
unable to load root certificates: unable to parse bytes as PEM block
W0905 04:24:31.362211       1 cacher.go:172] 
Terminating all watchers from cacher rke2configtemplates.bootstrap.cluster.x-k8s.io
E0905 04:24:31.367443       1 customresource_handler.go:301] 
unable to load root certificates: unable to parse bytes as PEM block
W0905 04:24:31.433130       1 cacher.go:172] 
Terminating all watchers from cacher rke2controlplanes.controlplane.cluster.x-k8s.io
E0905 04:24:31.439521       1 customresource_handler.go:301] 
unable to load root certificates: unable to parse bytes as PEM block
E0905 04:24:31.447041       1 customresource_handler.go:301] 
unable to load root certificates: unable to parse bytes as PEM block
E0905 04:24:31.453168       1 customresource_handler.go:301] 
unable to load root certificates: unable to parse bytes as PEM block

the corresponding code is here: https://github.com/kubernetes/apiextensions-apiserver/blob/kubernetes-1.28.12/pkg/apiserver/customresource_handler.go#L301

and it does trigger the "error resolving resource" that we see in CAPI logs

probably coming from https://github.com/kubernetes/apiextensions-apiserver/blob/master/pkg/apiserver/customresource_handler.go#L690 > https://github.com/kubernetes/apiextensions-apiserver/blob/master/pkg/apiserver/conversion/converter.go#L69 > https://github.com/kubernetes/apiextensions-apiserver/blob/kubernetes-1.28.12/pkg/apiserver/conversion/webhook_converter.go#L102

https://github.com/kubernetes/apiextensions-apiserver/blob/kubernetes-1.28.12/pkg/apiserver/conversion/webhook_converter.go#L77

https://github.com/kubernetes/kubernetes/blob/v1.28.12/staging/src/k8s.io/apiserver/pkg/util/webhook/client.go#L120

cert-manager has possibly updated the injected CABundle, a few minutes before these errors - see here:

$ grep rke2controlplanes.controlplane.cluster.x-k8s.io cert-manager/cert-manager-cainjector-5d877d5b85-ff6lp/logs.txt
I0905 04:08:31.673457       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="customresourcedefinition" kind="customresourcedefinition" name="rke2controlplanes.controlplane.cluster.x-k8s.io"
I0905 04:10:15.405086       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="customresourcedefinition" kind="customresourcedefinition" name="rke2controlplanes.controlplane.cluster.x-k8s.io"
I0905 04:10:15.560066       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="customresourcedefinition" kind="customresourcedefinition" name="rke2controlplanes.controlplane.cluster.x-k8s.io"
I0905 04:14:21.471563       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="customresourcedefinition" kind="customresourcedefinition" name="rke2controlplanes.controlplane.cluster.x-k8s.io"
I0905 04:14:21.693425       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="customresourcedefinition" kind="customresourcedefinition" name="rke2controlplanes.controlplane.cluster.x-k8s.io"

We lack CRD resources dumps in debug-on-exit to dig more (this has since been fixed by !2836 (merged)).

Following an idea @feleouet, the reason for the broken caBundle is very plausibly the following:

  • the CRD definition produced by cabpr and capo Kustomizations specifies a caBundle (empty string encoded in base64: caBundle: Cg==)
  • Flux rewrites the caBundle to this value during each reconciliation
  • cert-manager then rewrites it with the real caBundle
  • ... in this CI job, it seems that cert-manager did not do that, or didn't do it properly

I'll keep this "CI failure" MR opened so that we can more easily identify new occurrences (trace of unable to parse bytes as PEM block in apiserver logs in CI artifacts).

And I'll file a separate issue to address the problematic Flux vs Cert-Manager topic.

Edited by Thomas Morin