fluentd-elasticsearch is failing to start up on some urgent-cpu nodes
The urgent-cpu-bound node pool has expanded the amount of nodes running automatically over the course of time. At least 2 nodes are failing to schedule the daemonset that we utilize to capture logs from the running Pods. This means that we will not be ingesting logs from any service on these nodes.
The problem is that this is a daemonset, and spinning up a new node does not solve the situation and Kubernetes doesn't know that it can spin up a node shift workloads in order to resolve this. This should be able to be resolved by using Pod Priorities, or by removing an existing Pod on the impacted node.
While we are here, investigate how long this has been going on, as this should've caused at least an alert.