slow non-contiguous CPU-GPU memory transfers

I see a dramatic slowdown on non-contiguous CPU-GPU memory transfers with cuda pointers. For example,

using T = char;
array<T, 2, cuda::allocator<T>> Dev({1024,1024});
array<T, 2> Host({1024,1024});

Dev = Host;   // this is fine, since the transfer is contiguous
Dev.sliced(0,512) = Host.sliced(0,512);   // this is fine, since the transfer is contiguous
Dev({0,512},{0,512}) = Host({0,512},{0,512});   // this is x10^4 times slower!!!

Edited Aug 08, 2021 by Alfredo Correa

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information