Improve pow(x,y): 25% speedup, increase accuracy for integer exponents.
This change optimizes the extended accuracy log2()
operator used by pow<float>(x,y)
in Eigen. This nets a 25% speedup for pow<float>(x,y)
while keeping the maximum absolute error below 3 ulps. A future change will speed up the function for double, if possible.
This change also partly removes a specialization for integer exponents, which in some cases led to very large errors for large integer exponents. For exponents in [-3,7] the error of the simpler recursive doubling algorithm is less than 3 ulp, matching the generic implementation. The recursive doubling algorithm is about 30x faster than the generic one, so it is retained to cover the common case of small integer exponents.
Measured speedup for non-integer exponents:
SSE 4.2:
name old cpu/op new cpu/op delta
BM_eigen_powquarter_float/1 3.39ns ± 1% 3.38ns ± 1% -0.57% (p=0.000 n=46+48)
BM_eigen_powquarter_float/8 95.0ns ± 0% 95.2ns ± 0% +0.16% (p=0.000 n=54+53)
BM_eigen_powquarter_float/64 1.17µs ± 2% 0.96µs ± 6% -17.63% (p=0.000 n=53+57)
BM_eigen_powquarter_float/512 9.72µs ± 3% 7.81µs ± 6% -19.62% (p=0.000 n=56+50)
BM_eigen_powquarter_float/4k 78.2µs ± 3% 63.5µs ± 7% -18.73% (p=0.000 n=55+58)
BM_eigen_powquarter_float/32k 624µs ± 3% 499µs ± 4% -20.06% (p=0.000 n=57+48)
BM_eigen_powquarter_float/256k 5.00ms ± 2% 4.05ms ± 7% -18.96% (p=0.000 n=57+45)
BM_eigen_powquarter_float/1M 20.1ms ± 2% 16.1ms ± 7% -19.97% (p=0.000 n=56+52)
AVX2+FMA:
name old cpu/op new cpu/op delta
BM_eigen_powquarter_float/1 3.39ns ± 1% 3.54ns ± 0% +4.63% (p=0.000 n=44+49)
BM_eigen_powquarter_float/8 94.9ns ± 0% 94.8ns ± 0% -0.06% (p=0.007 n=52+52)
BM_eigen_powquarter_float/64 880ns ± 7% 671ns ± 3% -23.70% (p=0.000 n=57+57)
BM_eigen_powquarter_float/512 7.13µs ± 8% 5.25µs ± 3% -26.37% (p=0.000 n=60+55)
BM_eigen_powquarter_float/4k 57.0µs ± 6% 42.0µs ± 3% -26.33% (p=0.000 n=58+53)
BM_eigen_powquarter_float/32k 462µs ± 8% 334µs ± 5% -27.66% (p=0.000 n=60+48)
BM_eigen_powquarter_float/256k 3.68ms ± 8% 2.67ms ± 3% -27.45% (p=0.000 n=50+49)
BM_eigen_powquarter_float/1M 14.7ms ± 5% 10.7ms ± 3% -27.51% (p=0.000 n=60+55)
AVX512:
name old cpu/op new cpu/op delta
BM_eigen_powquarter_float/1 3.38ns ± 1% 3.27ns ± 0% -3.29% (p=0.000 n=45+48)
BM_eigen_powquarter_float/8 95.0ns ± 1% 95.5ns ± 1% +0.54% (p=0.000 n=56+56)
BM_eigen_powquarter_float/64 581ns ± 3% 492ns ± 3% -15.35% (p=0.000 n=57+60)
BM_eigen_powquarter_float/512 3.85µs ± 3% 2.99µs ± 2% -22.34% (p=0.000 n=49+49)
BM_eigen_powquarter_float/4k 29.9µs ± 4% 23.0µs ± 3% -23.07% (p=0.000 n=50+57)
BM_eigen_powquarter_float/32k 238µs ± 4% 182µs ± 3% -23.36% (p=0.000 n=52+57)
BM_eigen_powquarter_float/256k 1.90ms ± 4% 1.46ms ± 3% -23.45% (p=0.000 n=58+58)
BM_eigen_powquarter_float/1M 7.65ms ± 4% 5.83ms ± 3% -23.74% (p=0.000 n=58+56)
Edited by Rasmus Munk Larsen