evenly spread the load on all the cluster nodes
There are things that result in an uneven spread of the load on cluster nodes:
- we start deploying units as soon as one node is up, before the other nodes are ready, so the first nodes have more pods
- during a Cluster API rolling upgrade, we drain a node, pods are created on other nodes to compensate for that, and when a new node is recreated nothing populates it (except daemonsets, or pods that could not be created on other nodes because of anti-affinity or limited resources)
- (we don't have anti-affinity for most things)
The result is like this (example with memory):
The node with highest load is the CP node having been installed first, it has twice as much used RAM as the least loaded node.
We need to introduce something to smoothen this
- one possibility is to add anti-affinity for all workloads (this is a lot of work)
- another one is to have a way to postpone the deployment of units so that it only happens when all the nodes are built
- this is worth doing, but will not address the uneven load resulting from a CAPI rolling upgrade
- we can introduce the k8s descheduler which looks promising to me (k8S official SIG, actively maintained, made exactly for this)
- it will still be worth doing 2 to avoid throttling during the early phases of the cluster deployment
/cc @feleouet - we already talked about doing 2