Use reinterpret_cast on GPU for bit_cast.
This seems to be the recommended approach for doing type punning in CUDA. See for example
- https://stackoverflow.com/questions/47037104/cuda-type-punning-memcpy-vs-ub-union
- https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/
(the latter puns a double to an int2). The issue is that for CUDA, the memcpy is not elided, and ends up
being an expensive operation. We already have similar reintepret_casts across
the Eigen codebase for GPU (as does TensorFlow).