improve load repartition across management cluster Nodes (#102) · Epics · Sylva-projects

improve load repartition across management cluster Nodes

Today we don't have much in place to ensure that the Nodes of the management cluster are evenly loaded. We have observed that in some cases there can be significant differences in the load of the different Nodes, for CPU and in particular for memory (e.g. one case where a node was at 40% mem usage while another was at 80% usage). Different things contribute to that: * k8s scheduler places pods based on `memory.request` but this number isn't always specified, and may or may not reflect actual usage * node rolling updates with `maxSurge: 0` tend to introduce uneveness * with `maxSurge: 0` the last operation done during a node rolling update is the creation of a fresh node, which is not followed by any drain of a node * this is what we have by default on ~capm3 deployment * (with `maxSurge: 0`, the drain of the last old node will populate the last created node -- this is the more typical behavior for VM-base environments) * some heavy workloads are single-replica (e.g. prometheus) We need to put in place things to improve that, some of which are already in progress: * (1) ensure that memory and CPU `resources.requests` are (a) close to actual average use and (b) not too far from peak use * relates to the effort of introducing VPA (sylva-core!6381) * is it sufficient ? does it cover first deployment of a Pod ? * (2) ensure population of fresh nodes during/after a node rolling update by enabling and tuning the `descheduler` unit * it seems that this isn't easy to do without first having done (1) * this is important essentially for ~capm3 (because they use `maxSurge: 0`) * (3) see when we can make heavy loads as 2/3 smaller Pods instead of a big one (for Prometheus this topic relates to the topic of revisiting our monitoring stack) * (not an exhaustive list!)

epic