Skip to content

Add support to do pack/unpack on GPU and do MPI on CPU

Junchao Zhang requested to merge jczhang/feature-sf-do-pack-on-gpu into master

Previously, to scatter vectors on GPU, we have methods
(1) Copy the segment enclosing the needed entries from GPU to CPU, and then do MPI. (-use_gpu_aware_mpi 0 -vecscatter_packongpu 0)
(3) Pack the needed entries on GPU and call CUDA-aware MPI. (-use_gpu_aware_mpi 1)

@karlrupp doubted performance gain of Method 3), so I added Method 2 in this MR and had the following result
(2) Packed the needed entries on GPU, then copy them to CPU, then do MPI. (-use_gpu_aware_mpi 0 -vecscatter_packongpu 1)

cudampi-2

We can observe:

  1. In MatMult(), we copy the whole Mvctx->lvec from CPU to GPU, so Method 2 does not save on this; But it does help when scattering off-diagonal vector entries, so we see savings in GPU to CPU copy.
  2. With more ranks and same vector, Method 2 gets diminishing advantage, since less redundant entries are copied.
  3. In Method 2, we only copy entries needed. So the up/down memcpy size between CPU and GPU are the same.
  4. With more ranks, more vector entries become remote, so the overall memcpy size between CPU and GPU gets bigger (2446.1 > 6102.0)

Method 2 is useful for performance study or when CUDA-aware MPI is unavailable (though getting less likely)

I hope I can get more insights about the MPS case in the GPU hackathon.

Edited by Junchao Zhang

Merge request reports