Modified sqrt/rsqrt for denormal handling.
This updates the new generic sqrt/rsqrt implementation after !868 (merged) to account for the following:
- Better handling of
std::numeric_limits<T>::denorm_min()(the original incorrectly returnsNaNfor AVX512) - Better handling of denormals in general (will often give correct
answers rather than flushing to 0/
inf) - Faster
sqrtandrsqrtfor AVX512 (but slightly slower rsqrt for SSE, AVX had no change)
Google benchmark numbers (only significant changes shown):
Comparing ./sqrt_old_sse4.2 to ./sqrt_new_sse4.2
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------
BM_Rsqrt<float>/8/1 +0.1165 +0.1165 5 5 5 5
BM_Rsqrt<float>/64/1 +0.1355 +0.1355 25 28 25 28
BM_Rsqrt<float>/512/1 +0.1340 +0.1340 195 221 195 221
BM_Rsqrt<float>/2048/1 +0.0715 +0.0714 1016 1089 1016 1089
Comparing ./sqrt_old_avx512dq to ./sqrt_new_avx512dq
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------
BM_Sqrt<float>/8/1 -0.0226 -0.0226 9 8 9 8
BM_Sqrt<float>/64/1 -0.3050 -0.3050 14 9 14 9
BM_Sqrt<float>/512/1 -0.3282 -0.3282 104 70 104 70
BM_Sqrt<float>/2048/1 -0.2790 -0.2790 469 338 469 338
BM_Sqrt<double>/8/1 -0.1990 -0.1990 5 4 5 4
BM_Sqrt<double>/64/1 -0.2366 -0.2366 34 26 34 26
BM_Sqrt<double>/512/1 -0.2236 -0.2236 313 243 313 243
BM_Sqrt<double>/2048/1 -0.2237 -0.2237 1287 999 1287 999
BM_Rsqrt<float>/8/1 +0.0166 +0.0165 5 5 5 5
BM_Rsqrt<float>/64/1 -0.0715 -0.0715 11 10 11 10
BM_Rsqrt<float>/512/1 -0.1097 -0.1097 82 73 82 73
BM_Rsqrt<float>/2048/1 -0.1323 -0.1323 387 335 387 335
BM_Rsqrt<double>/8/1 -0.0874 -0.0874 5 5 5 5
BM_Rsqrt<double>/64/1 -0.1198 -0.1198 31 27 31 27
BM_Rsqrt<double>/512/1 -0.1499 -0.1499 287 244 287 244
BM_Rsqrt<double>/2048/1 -0.1728 -0.1727 1181 977 1181 977
OVERALL_GEOMEAN -0.1616 -0.1616 0 0 0 0
Edited by Antonio Sánchez