Simplify and speed up pow() by 5-6% (!1754) · Merge requests · libeigen / eigen

Since improving the accuracy of generic_exp2 in !1715 (merged), we no longer need the complex and slow version used for pow. We can also take advantage of our knowledge of the magnitudes in the final product to use pldexp_fast in most cases. The accuracy for float is still within 2 ULPs for non-integer exponents.

Measured speedup of a.array().pow(0.25), which invokes the generic path:

AVX2+FMA:

name                             old cpu/op   new cpu/op   delta
BM_eigen_powquarter_float/8      95.4ns ± 0%  94.9ns ± 0%  -0.52%  (p=0.000 n=52+56)
BM_eigen_powquarter_float/64      949ns ± 4%   892ns ± 3%  -5.98%  (p=0.000 n=60+59)
BM_eigen_powquarter_float/512    7.66µs ± 4%  7.23µs ± 5%  -5.61%  (p=0.000 n=59+59)
BM_eigen_powquarter_float/4k     61.5µs ± 3%  57.7µs ± 3%  -6.14%  (p=0.000 n=60+54)
BM_eigen_powquarter_float/32k     491µs ± 3%   464µs ± 4%  -5.57%  (p=0.000 n=60+58)
BM_eigen_powquarter_float/256k   3.93ms ± 4%  3.73ms ± 4%  -5.16%  (p=0.000 n=49+49)
BM_eigen_powquarter_float/1M     15.7ms ± 5%  15.0ms ± 5%  -4.94%  (p=0.000 n=59+59)
BM_eigen_powquarter_double/8      350ns ± 1%   344ns ± 0%  -1.69%  (p=0.000 n=39+46)
BM_eigen_powquarter_double/64    2.82µs ± 4%  2.69µs ± 6%  -4.65%  (p=0.000 n=54+55)
BM_eigen_powquarter_double/512   22.3µs ± 4%  21.6µs ± 5%  -3.33%  (p=0.000 n=60+53)
BM_eigen_powquarter_double/4k     178µs ± 4%   172µs ± 6%  -3.71%  (p=0.000 n=59+55)
BM_eigen_powquarter_double/32k   1.42ms ± 4%  1.37ms ± 5%  -3.43%  (p=0.000 n=60+52)
BM_eigen_powquarter_double/256k  11.4ms ± 5%  10.9ms ± 6%  -4.05%  (p=0.000 n=60+59)
BM_eigen_powquarter_double/1M    45.7ms ± 4%  43.7ms ± 6%  -4.40%  (p=0.000 n=60+58)

Edited Nov 20, 2024 by Rasmus Munk Larsen

Simplify and speed up pow() by 5-6%

Merge request reports