Simplify and speed up pow() by 5-6%
Since improving the accuracy of generic_exp2 in !1715 (merged), we no longer need the complex and slow version used for pow. We can also take advantage of our knowledge of the magnitudes in the final product to use pldexp_fast in most cases. The accuracy for float is still within 2 ULPs for non-integer exponents.
Measured speedup of a.array().pow(0.25), which invokes the generic path:
AVX2+FMA:
name old cpu/op new cpu/op delta
BM_eigen_powquarter_float/8 95.4ns ± 0% 94.9ns ± 0% -0.52% (p=0.000 n=52+56)
BM_eigen_powquarter_float/64 949ns ± 4% 892ns ± 3% -5.98% (p=0.000 n=60+59)
BM_eigen_powquarter_float/512 7.66µs ± 4% 7.23µs ± 5% -5.61% (p=0.000 n=59+59)
BM_eigen_powquarter_float/4k 61.5µs ± 3% 57.7µs ± 3% -6.14% (p=0.000 n=60+54)
BM_eigen_powquarter_float/32k 491µs ± 3% 464µs ± 4% -5.57% (p=0.000 n=60+58)
BM_eigen_powquarter_float/256k 3.93ms ± 4% 3.73ms ± 4% -5.16% (p=0.000 n=49+49)
BM_eigen_powquarter_float/1M 15.7ms ± 5% 15.0ms ± 5% -4.94% (p=0.000 n=59+59)
BM_eigen_powquarter_double/8 350ns ± 1% 344ns ± 0% -1.69% (p=0.000 n=39+46)
BM_eigen_powquarter_double/64 2.82µs ± 4% 2.69µs ± 6% -4.65% (p=0.000 n=54+55)
BM_eigen_powquarter_double/512 22.3µs ± 4% 21.6µs ± 5% -3.33% (p=0.000 n=60+53)
BM_eigen_powquarter_double/4k 178µs ± 4% 172µs ± 6% -3.71% (p=0.000 n=59+55)
BM_eigen_powquarter_double/32k 1.42ms ± 4% 1.37ms ± 5% -3.43% (p=0.000 n=60+52)
BM_eigen_powquarter_double/256k 11.4ms ± 5% 10.9ms ± 6% -4.05% (p=0.000 n=60+59)
BM_eigen_powquarter_double/1M 45.7ms ± 4% 43.7ms ± 6% -4.40% (p=0.000 n=60+58)
Edited by Rasmus Munk Larsen