remove denormal flushing in fp32tobf16 for avx & avx512

Reference issue

What does this implement/fix?

Flushing denormals inside FP32ToBF16 is consuming too much and making BF16 Eigen ops much slower than FP32 on AVX512 & AVX. This is actually not required here as this should be handled at global level inside the code using using Eigen library. eg: tensorflow sets it correctly when creating a new Eigen thread : https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/threadpool.cc#L56

Additional information

With this change, we have seen significant performance increase for models run in BF16.
Edited by Gauri Deshpande

Merge request reports

Loading