Use fma<float> for fma<half> and fma<bfloat16> if native fma is not available on the platform.
Thanks to @sandwichmaker for pointing out this corner case: If a*b overflows, but a*b+c is finite, computing a*b+c using standard float32 operations will cause overflow, while fma(a,b,c) will not.
Edited by Rasmus Munk Larsen