Realm: Add cuda scatter-reduce
The PR adds reduction operators in cuda-dma
for scatter/gather kernels. It extends cuda_redop.h
by adding two wrappers around memcpy_indirect_points
which accept reduction operators.
The open questions are:
- Whether it's a reasonable approach to "pre-generate" scatter/gather reduce kernels this way. This adds redop_apply (NDIMS * NTYPES) + redop_fold (NDIMS * NTYPES)
- The gather/scatter reductions are defined in cuda_redop.h which in turn requires an implementation of
memcpy_indirect_points
andmemcpy_affine_batch
to be available and hence they are moved into a headercuda_memcpy.h
. - Avoid adding a whole bunch of macros.
TODO:
- CI has multiple failures and there are still some bugs (will be fixed shortly).
-
cuda_memcpy_affine_batch
isn't tested separately with this change (will be fixed shortly). - This is fairly large PR and I will consider splitting it up on a number of small changes.
Edited by apryakhin