Realm: Add cuda scatter-reduce (!863) · Merge requests · StanfordLegion / legion

apryakhin requested to merge cuda-dma-scatter-reduce into cuda-dma Aug 01, 2023

The PR adds reduction operators in cuda-dma for scatter/gather kernels. It extends cuda_redop.h by adding two wrappers around memcpy_indirect_points which accept reduction operators.

The open questions are:

Whether it's a reasonable approach to "pre-generate" scatter/gather reduce kernels this way. This adds redop_apply (NDIMS * NTYPES) + redop_fold (NDIMS * NTYPES)
The gather/scatter reductions are defined in cuda_redop.h which in turn requires an implementation of memcpy_indirect_points and memcpy_affine_batch to be available and hence they are moved into a header cuda_memcpy.h.
Avoid adding a whole bunch of macros.

TODO:

CI has multiple failures and there are still some bugs (will be fixed shortly).
cuda_memcpy_affine_batch isn't tested separately with this change (will be fixed shortly).
This is fairly large PR and I will consider splitting it up on a number of small changes.

Edited Aug 04, 2023 by apryakhin

Realm: Add cuda scatter-reduce

Merge request reports