Add generic fast psqrt and prsqrt impls and make them correct for 0, +Inf, NaN, and negative arguments.
- Consolidate fast psqrt and prsqrt into generic implementations and avoid duplicating this code for SSE,AVX, and AVVX512. TODO: Use these generic implementations for more architectures.
- Make both fast psqrt and prsqrt correct for 0, Inf, NaN and negative arguments. These functions are now fully standard compliant, except that they treat positive subnormal input arguments as zeros.
The performance regressions associated with these changes are less than 5% measured for SSE+FMA, AVX, and AVX512 on Skylake.
Edited by Rasmus Munk Larsen