Speed up exp(x) by 30-35%
By taking advantage of our knowledge of the limited range of the arguments to pldexp in the implementations of pexp, we can call a much faster version in most cases where the result is not subnormal.
This change also changes the clamping of arguments for pexp<double>(x) so we don't needlessly flush subnormal results to zero. The relative accuracy of subnormal results degrade gracefully from ~10^-16 for normalized results to ~10^-12 for the smallest subnormal results, which is acceptable.
We measure a significant speedup:
SSE 4.2:
name old cpu/op new cpu/op delta
BM_eigen_exp_double/1 3.54ns ± 0% 2.55ns ± 1% -28.03% (p=0.000 n=46+49)
BM_eigen_exp_double/8 55.6ns ± 1% 54.4ns ± 1% -2.14% (p=0.000 n=55+46)
BM_eigen_exp_double/64 256ns ± 8% 194ns ± 7% -24.36% (p=0.000 n=54+56)
BM_eigen_exp_double/512 1.82µs ± 7% 1.25µs ± 6% -31.44% (p=0.000 n=51+58)
BM_eigen_exp_double/4k 14.3µs ± 7% 9.7µs ± 9% -31.96% (p=0.000 n=57+60)
BM_eigen_exp_double/32k 113µs ± 5% 78µs ± 7% -31.53% (p=0.000 n=52+60)
BM_eigen_exp_double/256k 909µs ± 5% 620µs ± 8% -31.86% (p=0.000 n=53+60)
BM_eigen_exp_double/1M 3.73ms ± 9% 2.51ms ± 7% -32.91% (p=0.000 n=46+53)
BM_eigen_exp_float/1 2.46ns ± 1% 1.92ns ± 1% -21.93% (p=0.000 n=49+56)
BM_eigen_exp_float/8 37.3ns ±12% 36.6ns ± 1% -2.03% (p=0.000 n=40+43)
BM_eigen_exp_float/64 116ns ±10% 91ns ± 5% -21.55% (p=0.000 n=58+59)
BM_eigen_exp_float/512 701ns ± 4% 505ns ± 5% -27.96% (p=0.000 n=51+59)
BM_eigen_exp_float/4k 5.37µs ± 4% 3.80µs ± 4% -29.24% (p=0.000 n=51+50)
BM_eigen_exp_float/32k 43.1µs ± 7% 30.1µs ± 4% -30.23% (p=0.000 n=54+49)
BM_eigen_exp_float/256k 345µs ± 6% 242µs ± 5% -29.86% (p=0.000 n=47+54)
BM_eigen_exp_float/1M 1.38ms ± 7% 0.97ms ± 5% -30.10% (p=0.000 n=57+59)
AVX2+FMA:
name old cpu/op new cpu/op delta
BM_eigen_exp_double/1 3.00ns ± 1% 2.73ns ± 0% -9.05% (p=0.000 n=52+51)
BM_eigen_exp_double/8 53.3ns ± 3% 52.2ns ± 2% -2.00% (p=0.000 n=50+55)
BM_eigen_exp_double/64 215ns ± 5% 164ns ± 5% -23.91% (p=0.000 n=58+59)
BM_eigen_exp_double/512 1.48µs ± 9% 1.03µs ± 4% -30.84% (p=0.000 n=60+55)
BM_eigen_exp_double/4k 11.4µs ± 7% 7.9µs ± 3% -30.95% (p=0.000 n=59+57)
BM_eigen_exp_double/32k 91.5µs ± 7% 63.2µs ± 6% -30.96% (p=0.000 n=60+58)
BM_eigen_exp_double/256k 732µs ± 6% 507µs ± 6% -30.73% (p=0.000 n=59+58)
BM_eigen_exp_double/1M 2.98ms ± 6% 2.04ms ± 6% -31.78% (p=0.000 n=49+53)
BM_eigen_exp_float/1 2.18ns ± 0% 2.45ns ± 0% +12.40% (p=0.000 n=56+54)
BM_eigen_exp_float/8 29.2ns ± 1% 28.6ns ± 0% -1.85% (p=0.000 n=47+48)
BM_eigen_exp_float/64 84.9ns ± 3% 63.4ns ± 5% -25.28% (p=0.000 n=56+59)
BM_eigen_exp_float/512 523ns ± 3% 325ns ± 6% -37.73% (p=0.000 n=56+47)
BM_eigen_exp_float/4k 4.06µs ± 4% 2.43µs ± 4% -40.19% (p=0.000 n=48+50)
BM_eigen_exp_float/32k 32.1µs ± 5% 19.2µs ± 4% -40.21% (p=0.000 n=50+49)
BM_eigen_exp_float/256k 258µs ± 4% 155µs ± 7% -39.73% (p=0.000 n=51+59)
BM_eigen_exp_float/1M 1.03ms ± 4% 0.62ms ± 7% -39.97% (p=0.000 n=57+57)
AVX512:
name old cpu/op new cpu/op delta
BM_eigen_exp_double/1 3.26ns ± 4% 2.73ns ± 1% -16.28% (p=0.000 n=50+50)
BM_eigen_exp_double/8 108ns ± 5% 101ns ± 6% -6.60% (p=0.000 n=60+48)
BM_eigen_exp_double/64 199ns ± 6% 170ns ± 5% -14.53% (p=0.000 n=60+59)
BM_eigen_exp_double/512 857ns ± 7% 578ns ± 7% -32.51% (p=0.000 n=58+57)
BM_eigen_exp_double/4k 6.13µs ± 8% 3.89µs ± 9% -36.46% (p=0.000 n=56+50)
BM_eigen_exp_double/32k 48.4µs ± 9% 30.8µs ± 9% -36.46% (p=0.000 n=58+51)
BM_eigen_exp_double/256k 388µs ± 9% 254µs ± 6% -34.45% (p=0.000 n=49+50)
BM_eigen_exp_double/1M 1.58ms ±12% 1.03ms ± 9% -34.52% (p=0.000 n=57+59)
BM_eigen_exp_float/1 3.56ns ± 4% 3.42ns ± 5% -3.89% (p=0.000 n=49+50)
BM_eigen_exp_float/8 32.1ns ± 5% 32.2ns ± 5% ~ (p=0.406 n=51+49)
BM_eigen_exp_float/64 90.2ns ± 5% 83.4ns ± 7% -7.62% (p=0.000 n=59+58)
BM_eigen_exp_float/512 279ns ± 4% 214ns ± 5% -23.45% (p=0.000 n=51+55)
BM_eigen_exp_float/4k 1.80µs ± 5% 1.31µs ± 5% -27.37% (p=0.000 n=56+57)
BM_eigen_exp_float/32k 13.9µs ± 4% 10.5µs ± 6% -24.76% (p=0.000 n=56+56)
BM_eigen_exp_float/256k 120µs ± 6% 101µs ± 5% -16.20% (p=0.000 n=58+56)
BM_eigen_exp_float/1M 482µs ± 5% 406µs ± 6% -15.75% (p=0.000 n=59+51)
Edited by Rasmus Munk Larsen