Unclear errors on kubeadm undersized nodes
Summary
Misconfigured cluster with undersized VM was unhealthy with lots of pods in crash loop, but nodes had no MemoryPressure
Steps to reproduce
Deploy kubeadm-capo with default flavor (m1.large - 4vcpu - 8Gb RAM)
What is the current bug behavior?
Kswapd was consuming a huge amount of cpu, there was only a few amount of RAM available (~500M)
Containerd was showing plenty of errors of timeouts to exec heathchecks in containers:
Jun 27 16:12:17 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:17.178611532Z" level=error msg="ttrpc: received message on inactive stream" stream=21
Jun 27 16:12:17 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:17.709655909Z" level=warning msg="cleanup warnings time=\"2024-06-27T16:12:16Z\" level=warning msg=\"failed to remove runc container\" error=\"runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n" namespace=k8s.io
Jun 27 16:12:18 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:17.949546759Z" level=error msg="failed to delete shim" error="1 error occurred:\n\t* close wait error: context deadline exceeded\n\n" id=3446a992d3cd8e0b1dbcdd97ed12c24f0294bf7e749ff14aa78336b3913bbb35
Jun 27 16:12:18 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:18.617145364Z" level=error msg="ExecSync for \"512642eaa65258cc5611bff30864f0ade97b354272e9efe24ce6344f1719fd7b\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 5s exceeded: context deadline exceeded"
Jun 27 16:12:18 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:18.693342418Z" level=error msg="ExecSync for \"512642eaa65258cc5611bff30864f0ade97b354272e9efe24ce6344f1719fd7b\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 5s exceeded: context deadline exceeded"
Jun 27 16:12:19 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:19.975645101Z" level=error msg="ExecSync for \"5d3a740cc4b516bf63d85526816a1c9a172e9246c253551dd58057a103645056\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 5s exceeded: context deadline exceeded"
Jun 27 16:12:20 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: E0627 16:12:20.842955 1105 httpstream.go:290] error forwarding port 2379 to pod 7d53e2d21c400f3a61c693c996dd99acb4e678a4cef2227604b18d37c17bb217, uid : failed to execute portforward in network namespace "host": EOF
Jun 27 16:12:21 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:21.238396218Z" level=warning msg="cleanup warnings time=\"2024-06-27T16:12:21Z\" level=warning msg=\"failed to remove runc container\" error=\"runc did not terminate successfully: exit status 255: \" runtime=io.containerd.runc.v2\n" namespace=k8s.io
Jun 27 16:12:21 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:21.633673979Z" level=error msg="ExecSync for \"09b745a68767142664a40ff078edded0fb70be79e2d88f645000adf1731a5f34\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 10s exceeded: context deadline exceeded"
Jun 27 16:12:30 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:30.437003128Z" level=error msg="get state for 881f356d5ab7b2b5fbd2b7fb95a67727f004c85693948f15fba945bb1331a87e" error="context deadline exceeded: unknown"
Jun 27 16:12:32 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:32.344834714Z" level=error msg="ExecSync for \"09b745a68767142664a40ff078edded0fb70be79e2d88f645000adf1731a5f34\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 10s exceeded: context deadline exceeded"
Jun 27 16:12:32 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: time="2024-06-27T16:12:32.593471036Z" level=error msg="ttrpc: received message on inactive stream" stream=3
Jun 27 16:12:35 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: E0627 16:12:35.961563 1105 httpstream.go:290] error forwarding port 2379 to pod 7d53e2d21c400f3a61c693c996dd99acb4e678a4cef2227604b18d37c17bb217, uid : failed to execute portforward in network namespace "host": EOF
Jun 27 16:12:39 management-cluster-cp-0ba2c928af-rq98n containerd[1105]: E0627 16:12:39.865919 1105 httpstream.go:290] error forwarding port 2379 to pod 7d53e2d21c400f3a61c693c996dd99acb4e678a4cef2227604b18d37c17bb217, uid : failed to execute portforward in network namespace "host": EOF
What is the expected correct behavior?
I would have expected to see some warning on nodes health, not containers being restarted without any obvious explanations
Edited by Francois Eleouet