Unclear errors on kubeadm undersized nodes

Summary

Misconfigured cluster with undersized VM was unhealthy with lots of pods in crash loop, but nodes had no MemoryPressure

Steps to reproduce

Deploy kubeadm-capo with default flavor (m1.large - 4vcpu - 8Gb RAM)

What is the current bug behavior?

Kswapd was consuming a huge amount of cpu, there was only a few amount of RAM available (~500M)

Containerd was showing plenty of errors of timeouts to exec heathchecks in containers:

Jun 27 16:12:17 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:17.178611532Z" level=error msg="ttrpc: received message on inactive stream" stream=21
Jun 27 16:12:17 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:17.709655909Z" level=warning msg="cleanup warnings time=\"2024-06-27T16:12:16Z\" level=warning msg=\"failed to remove runc container\" error=\"runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n" namespace=k8s.io
Jun 27 16:12:18 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:17.949546759Z" level=error msg="failed to delete shim" error="1 error occurred:\n\t* close wait error: context deadline exceeded\n\n" id=3446a992d3cd8e0b1dbcdd97ed12c24f0294bf7e749ff14aa78336b3913bbb35
Jun 27 16:12:18 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:18.617145364Z" level=error msg="ExecSync for \"512642eaa65258cc5611bff30864f0ade97b354272e9efe24ce6344f1719fd7b\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 5s exceeded: context deadline exceeded"
Jun 27 16:12:18 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:18.693342418Z" level=error msg="ExecSync for \"512642eaa65258cc5611bff30864f0ade97b354272e9efe24ce6344f1719fd7b\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 5s exceeded: context deadline exceeded"
Jun 27 16:12:19 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:19.975645101Z" level=error msg="ExecSync for \"5d3a740cc4b516bf63d85526816a1c9a172e9246c253551dd58057a103645056\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 5s exceeded: context deadline exceeded"
Jun 27 16:12:20 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: E0627 16:12:20.842955    1105 httpstream.go:290] error forwarding port 2379 to pod 7d53e2d21c400f3a61c693c996dd99acb4e678a4cef2227604b18d37c17bb217, uid : failed to execute portforward in network namespace "host": EOF
Jun 27 16:12:21 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:21.238396218Z" level=warning msg="cleanup warnings time=\"2024-06-27T16:12:21Z\" level=warning msg=\"failed to remove runc container\" error=\"runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n" namespace=k8s.io
Jun 27 16:12:21 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:21.633673979Z" level=error msg="ExecSync for \"09b745a68767142664a40ff078edded0fb70be79e2d88f645000adf1731a5f34\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 10s exceeded: context deadline exceeded"
Jun 27 16:12:30 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:30.437003128Z" level=error msg="get state for 881f356d5ab7b2b5fbd2b7fb95a67727f004c85693948f15fba945bb1331a87e" error="context deadline exceeded: unknown"
Jun 27 16:12:32 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:32.344834714Z" level=error msg="ExecSync for \"09b745a68767142664a40ff078edded0fb70be79e2d88f645000adf1731a5f34\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 10s exceeded: context deadline exceeded"
Jun 27 16:12:32 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:32.593471036Z" level=error msg="ttrpc: received message on inactive stream" stream=3
Jun 27 16:12:35 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: E0627 16:12:35.961563    1105 httpstream.go:290] error forwarding port 2379 to pod 7d53e2d21c400f3a61c693c996dd99acb4e678a4cef2227604b18d37c17bb217, uid : failed to execute portforward in network namespace "host": EOF
Jun 27 16:12:39 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: E0627 16:12:39.865919    1105 httpstream.go:290] error forwarding port 2379 to pod 7d53e2d21c400f3a61c693c996dd99acb4e678a4cef2227604b18d37c17bb217, uid : failed to execute portforward in network namespace "host": EOF

What is the expected correct behavior?

I would have expected to see some warning on nodes health, not containers being restarted without any obvious explanations

Edited Jun 28, 2024 by Francois Eleouet
Assignee Loading
Time tracking Loading