Distribution of grid points and contiguity

For a system needing 180x180x180 (big) points, and running with 16 MPI processes, the (uniform) distribution is:

InitMesh: MESH = 360 x 360 x 360 = 46656000
InitMesh: (bp) =   180 x   180 x   180 =     5832000
InitMesh: Mesh cutoff (required, used) =   150.000   176.881 Ry
New grid distribution [1]: sub = 2
           1       1:  180    1:   45    1:   45
           2       1:  180    1:   45   46:   90
           3       1:  180    1:   45   91:  135
           4       1:  180    1:   45  136:  180
           5       1:  180   46:   90    1:   45
           6       1:  180   46:   90   46:   90
           7       1:  180   46:   90   91:  135
           8       1:  180   46:   90  136:  180
           9       1:  180   91:  135    1:   45
          10       1:  180   91:  135   46:   90
          11       1:  180   91:  135   91:  135
          12       1:  180   91:  135  136:  180
          13       1:  180  136:  180    1:   45
          14       1:  180  136:  180   46:   90
          15       1:  180  136:  180   91:  135
          16       1:  180  136:  180  136:  180

All processes take the entire x-axis worth of points, so the distribution is based on a grid of processes in Y and Z, in this case 4x4. The grid of processes in Y and Z varies faster in Z, and not in Y... With the first 4 processes you have a slab perpendicular to Y. With (5-8) another slab, etc.

Now imagine that you need to group the data by process groups (say, those in a node). Assume for example that you need to create 8-process groups. In this case you would have for (P1-P8): (1:180,1:90,1:180), which is non-contiguous.

If the process grid varied faster in Y, you would have (1:180,1:180,1:90), which is contiguous.

Some of the ideas for optimization of IO by @rgrima involve a node-based grouping. It would be more convenient to use the "contiguous" alternative for the YZ grid. Is there any other reason to prefer the current scheme?