slow non-contiguous CPU-GPU memory transfers
I see a dramatic slowdown on non-contiguous CPU-GPU memory transfers with cuda pointers. For example,
using T = char;
array<T, 2, cuda::allocator<T>> Dev({1024,1024});
array<T, 2> Host({1024,1024});
Dev = Host; // this is fine, since the transfer is contiguous
Dev.sliced(0,512) = Host.sliced(0,512); // this is fine, since the transfer is contiguous
Dev({0,512},{0,512}) = Host({0,512},{0,512}); // this is x10^4 times slower!!!
Edited by Alfredo Correa