Distribution of grid points and contiguity
For a system needing 180x180x180 (big) points, and running with 16 MPI processes, the (uniform) distribution is:
InitMesh: MESH = 360 x 360 x 360 = 46656000
InitMesh: (bp) = 180 x 180 x 180 = 5832000
InitMesh: Mesh cutoff (required, used) = 150.000 176.881 Ry
New grid distribution [1]: sub = 2
1 1: 180 1: 45 1: 45
2 1: 180 1: 45 46: 90
3 1: 180 1: 45 91: 135
4 1: 180 1: 45 136: 180
5 1: 180 46: 90 1: 45
6 1: 180 46: 90 46: 90
7 1: 180 46: 90 91: 135
8 1: 180 46: 90 136: 180
9 1: 180 91: 135 1: 45
10 1: 180 91: 135 46: 90
11 1: 180 91: 135 91: 135
12 1: 180 91: 135 136: 180
13 1: 180 136: 180 1: 45
14 1: 180 136: 180 46: 90
15 1: 180 136: 180 91: 135
16 1: 180 136: 180 136: 180
All processes take the entire x-axis worth of points, so the distribution is based on a grid of processes in Y and Z, in this case 4x4. The grid of processes in Y and Z varies faster in Z, and not in Y... With the first 4 processes you have a slab perpendicular to Y. With (5-8) another slab, etc.
Now imagine that you need to group the data by process groups (say, those in a node). Assume for example that you need to create 8-process groups. In this case you would have for (P1-P8): (1:180,1:90,1:180), which is non-contiguous.
If the process grid varied faster in Y, you would have (1:180,1:180,1:90), which is contiguous.
Some of the ideas for optimization of IO by @rgrima involve a node-based grouping. It would be more convenient to use the "contiguous" alternative for the YZ grid. Is there any other reason to prefer the current scheme?