Skip to content

Revert bit_cast to use memcpy for CUDA.

To elide the memcpy, we need to first load the src value into registers by making a local copy. This avoids the need to resort to potential UB by using reinterpret_cast.

This change doesn't seem to affect CPU (at least not with gcc/clang). With optimizations on, the copy is also elided.

Merge request reports

Loading