WIP: Reorder the communication strategies to improve performance
Right now, all ranks are distributed in a 4D hypercube where the dimensions are domain, states, k points, and other (in this order). The call to MPI_Cart_create returns a cartesian communicator where usually all ranks are assigned in a row-major way. This means that the last index runs fastest (other) and the first index runs slowest (domain).
Usually, tasks are assigned to nodes in a block fashion, i.e. the first ranks (0 to x-1) go the first node, the ranks x to 2x-1 go to the second node etc. This means that the last parallelization strategy that is used is as compact as possible, whereas the first parallelization strategy is always spread out over several nodes. This means that currently the domain parallelization is not compact, i.e. the ranks sharing the domain for the same state and k point are usually spread out over several nodes.
The problem here is that communication inside a node is faster than across nodes because it can be handled in the shared memory as opposed to going through the interconnect. Thus, the communication that happens most often should be between ranks that are as compact as possible (to avoid going through the interconnect).
In octopus, the communication that happens most often is that using the domain parallelization strategy. Thus, it should be beneficial for most applications to have a compact arrangement of the tasks in the domain parallelization. This can be achieved by reordering the dimensions of the parallelization hypercube: by putting the domain parallelization as the last dimension, it will be placed as compact as possible on the nodes.
In this commit, the dimensions of the parallelization strategies are reversed to have the most used parallelization strategies placed as compact as possible.
Tests have shown up to 20% speed-up depending on the input data. For best performance, set ParDomains to the number of cores per node (or a fraction/multiple of this number, e.g. 1/2 or 1/4 or 2).
Reorder communication strategies to improve performance
- I have checked that my code follows the Octopus coding standards