Fix calls to device functions from host code

Removes undefined behavior when building with nvcc due to calls to host-only functions from device code. Fixes implemented either by restricting the calling function to the host or by creating device implementations where appropriate.

What does this implement/fix?

Fixes builds of TensorFlow, which can otherwise result in incorrect code when built with CUDA 11.3.

Merge request reports

Loading