Changing the Eigen::half implementation for HIP

Currently, when compiling with HIP, Eigen::half is derived from the __half_raw struct that is defined within the hip_fp16.h header file. This is true for both the "host" compile phase and the "device" compile phase. This was causing a very hard to detect bug in the ROCm TensorFlow build.

In the ROCm Tensorflow build,

  • files that do not contain ant GPU code get compiled via gcc, and
  • files that contnain GPU code get compiled via hipcc.

In certain case, we have a function that is defined in a file that is compiled by hipcc, and is called in a file that is compiled by gcc. If such a function had Eigen::half has a "pass-by-value" argument, its value was getting corrupted, when received by the function.

The reason for this seems to be that for the gcc compile, Eigen::half is derived from a __half_raw struct that has uint16_t as the data-store, and for hipcc the __half_raw implementation uses _Float16 as the data store. There is some ABI incompatibility between gcc / hipcc (which is essentially latest clang), which results in the Eigen::half value (which is correct at the call-site) getting randomly corrupted when passed to the function.

Changing the Eigen::half argument to be "pass by reference" seems to workaround the error.

In order to fix it such that we do not run into it again in TF, this commit changes the Eigne::half implementation to use the same __half_raw implementation as the non-GPU compile, during host compile phase of the hipcc compile.


/cc @cantonios @chhtz @rmlarsen1

This TF PR is an example of where we saw this issue (but did not correctly root cause it)

https://github.com/tensorflow/tensorflow/pull/42897

This (to be upstreamed) PR is another example of the same

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/pull/1271

Merge request reports

Loading