Revert bit_cast to use memcpy for CUDA.
To elide the memcpy, we need to first load the src value into
registers by making a local copy. This avoids the need to resort
to potential UB by using reinterpret_cast.
This change doesn't seem to affect CPU (at least not with gcc/clang). With optimizations on, the copy is also elided.