Skip to content

Realm: Add cuda scatter-reduce

apryakhin requested to merge cuda-dma-scatter-reduce into cuda-dma

The PR adds reduction operators in cuda-dma for scatter/gather kernels. It extends cuda_redop.h by adding two wrappers around memcpy_indirect_points which accept reduction operators.

The open questions are:

  • Whether it's a reasonable approach to "pre-generate" scatter/gather reduce kernels this way. This adds redop_apply (NDIMS * NTYPES) + redop_fold (NDIMS * NTYPES)
  • The gather/scatter reductions are defined in cuda_redop.h which in turn requires an implementation of memcpy_indirect_points and memcpy_affine_batch to be available and hence they are moved into a header cuda_memcpy.h.
  • Avoid adding a whole bunch of macros.

TODO:

  • CI has multiple failures and there are still some bugs (will be fixed shortly).
  • cuda_memcpy_affine_batch isn't tested separately with this change (will be fixed shortly).
  • This is fairly large PR and I will consider splitting it up on a number of small changes.
Edited by apryakhin

Merge request reports