Optimize CUDA PME Decomposition
These issues will be applicable to both hybrid mode and full-GPU decomposition. Two issues:
- Reported by @pszilard at !1711 (comment 638922484). Two Packing/reduction kernels are launched for left and right neighbor. These are independent tasks so, can be moved to different stream or packed in a single kernel.
- Currently, we create a full PME grid on GPU on each rank to avoid doing co-ordinate transformations in spread/gather kernel. The side effect of this is that grid data that needs to be communicated between X-domain neighbors in not contiguous in case of pencil-decomposition. For the time being, I send the full data along y-dimension but, this can be optimized either by packing/un-packing non-contiguous data or we can allocate smaller grids in each local rank and do necessary co-ordinate transform in spread/gather kernel. I am inclined towards the later.
- Experiment with different block size and decide on optimal configuration for kernels use in halo exchange and PME grid <-> FFT grid conversion. !2117 (comment 718564376)