Make preciprocal IEEE compliant w.r.t. 1/0 and 1/inf.
The new implementation takes advantage of the precondition that the starting approximation is 0 for infinite arguments and vice versa. Since one term in the Newton-Raphson step is the product of the argument and the approximation, we can detect zeros and infinities by checking if this term is NaN. This is faster than explicitly testing whether the argument is inf or 0.
Here are benchmark results comparing this implementation with pdiv (i.e. the state before commit ea2c0206.) The change also appears to speed up the scalar path significantly (not sure why, but I'll take it).
SSE+FMA (-mfma)
name old cpu/op new cpu/op delta
BM_eigen_inverse_float/1 2.72ns ± 0% 1.09ns ± 0% -59.73% (p=0.000 n=53+58)
BM_eigen_inverse_float/8 5.71ns ± 1% 5.70ns ± 0% -0.20% (p=0.000 n=49+54)
BM_eigen_inverse_float/64 19.0ns ± 0% 10.4ns ± 3% -45.25% (p=0.000 n=52+60)
BM_eigen_inverse_float/512 95.0ns ± 0% 63.8ns ± 3% -32.85% (p=0.000 n=55+59)
BM_eigen_inverse_float/4k 703ns ± 0% 508ns ± 3% -27.80% (p=0.000 n=55+60)
BM_eigen_inverse_float/32k 5.57µs ± 0% 4.89µs ± 3% -12.30% (p=0.000 n=56+60)
BM_eigen_inverse_float/256k 78.3µs ± 1% 80.4µs ± 2% +2.62% (p=0.000 n=60+60)
BM_eigen_inverse_float/1M 313µs ± 2% 322µs ± 3% +2.82% (p=0.000 n=35+35)
AVX+FMA (-march=skylake)
name old cpu/op new cpu/op delta
BM_eigen_inverse_float/1 2.72ns ± 0% 1.10ns ± 0% -59.68% (p=0.000 n=47+60)
BM_eigen_inverse_float/8 5.72ns ± 0% 5.70ns ± 0% -0.39% (p=0.000 n=52+55)
BM_eigen_inverse_float/64 19.0ns ± 0% 11.0ns ± 2% -41.86% (p=0.000 n=52+60)
BM_eigen_inverse_float/512 95.1ns ± 0% 64.2ns ± 3% -32.44% (p=0.000 n=54+60)
BM_eigen_inverse_float/4k 704ns ± 0% 510ns ± 3% -27.53% (p=0.000 n=54+60)
BM_eigen_inverse_float/32k 5.57µs ± 0% 4.89µs ± 3% -12.32% (p=0.000 n=56+60)
BM_eigen_inverse_float/256k 78.4µs ± 1% 80.6µs ± 1% +2.88% (p=0.000 n=60+53)
BM_eigen_inverse_float/1M 314µs ± 1% 322µs ± 1% +2.79% (p=0.000 n=33+29)
AVX512+FMA (-march=skylake-avx512)
name old cpu/op new cpu/op delta
BM_eigen_inverse_float/1 2.72ns ± 0% 0.62ns ± 2% -77.23% (p=0.000 n=50+60)
BM_eigen_inverse_float/8 5.73ns ± 1% 6.28ns ± 2% +9.65% (p=0.000 n=55+59)
BM_eigen_inverse_float/64 31.4ns ± 2% 13.5ns ± 2% -57.17% (p=0.000 n=50+60)
BM_eigen_inverse_float/512 115ns ± 2% 46ns ± 2% -59.80% (p=0.000 n=60+59)
BM_eigen_inverse_float/4k 787ns ± 2% 324ns ± 3% -58.87% (p=0.000 n=60+49)
BM_eigen_inverse_float/32k 6.17µs ± 2% 3.20µs ± 3% -48.16% (p=0.000 n=60+50)
BM_eigen_inverse_float/256k 80.5µs ± 2% 80.6µs ± 1% ~ (p=0.262 n=57+60)
BM_eigen_inverse_float/1M 322µs ± 1% 322µs ± 1% ~ (p=0.429 n=30+29)
Edited by Rasmus Munk Larsen