Skip to content

Speed up exp(x) by 30-35%

By taking advantage of our knowledge of the limited range of the arguments to pldexp in the implementations of pexp, we can call a much faster version in most cases where the result is not subnormal.

This change also changes the clamping of arguments for pexp<double>(x) so we don't needlessly flush subnormal results to zero. The relative accuracy of subnormal results degrade gracefully from ~10^-16 for normalized results to ~10^-12 for the smallest subnormal results, which is acceptable.

We measure a significant speedup:

SSE 4.2:

name                       old cpu/op   new cpu/op   delta
BM_eigen_exp_double/1      3.54ns ± 0%  2.55ns ± 1%  -28.03%  (p=0.000 n=46+49)
BM_eigen_exp_double/8      55.6ns ± 1%  54.4ns ± 1%   -2.14%  (p=0.000 n=55+46)
BM_eigen_exp_double/64      256ns ± 8%   194ns ± 7%  -24.36%  (p=0.000 n=54+56)
BM_eigen_exp_double/512    1.82µs ± 7%  1.25µs ± 6%  -31.44%  (p=0.000 n=51+58)
BM_eigen_exp_double/4k     14.3µs ± 7%   9.7µs ± 9%  -31.96%  (p=0.000 n=57+60)
BM_eigen_exp_double/32k     113µs ± 5%    78µs ± 7%  -31.53%  (p=0.000 n=52+60)
BM_eigen_exp_double/256k    909µs ± 5%   620µs ± 8%  -31.86%  (p=0.000 n=53+60)
BM_eigen_exp_double/1M     3.73ms ± 9%  2.51ms ± 7%  -32.91%  (p=0.000 n=46+53)
BM_eigen_exp_float/1       2.46ns ± 1%  1.92ns ± 1%  -21.93%  (p=0.000 n=49+56)
BM_eigen_exp_float/8       37.3ns ±12%  36.6ns ± 1%   -2.03%  (p=0.000 n=40+43)
BM_eigen_exp_float/64       116ns ±10%    91ns ± 5%  -21.55%  (p=0.000 n=58+59)
BM_eigen_exp_float/512      701ns ± 4%   505ns ± 5%  -27.96%  (p=0.000 n=51+59)
BM_eigen_exp_float/4k      5.37µs ± 4%  3.80µs ± 4%  -29.24%  (p=0.000 n=51+50)
BM_eigen_exp_float/32k     43.1µs ± 7%  30.1µs ± 4%  -30.23%  (p=0.000 n=54+49)
BM_eigen_exp_float/256k     345µs ± 6%   242µs ± 5%  -29.86%  (p=0.000 n=47+54)
BM_eigen_exp_float/1M      1.38ms ± 7%  0.97ms ± 5%  -30.10%  (p=0.000 n=57+59)


AVX2+FMA:

name                       old cpu/op   new cpu/op   delta
BM_eigen_exp_double/1      3.00ns ± 1%  2.73ns ± 0%   -9.05%  (p=0.000 n=52+51)
BM_eigen_exp_double/8      53.3ns ± 3%  52.2ns ± 2%   -2.00%  (p=0.000 n=50+55)
BM_eigen_exp_double/64      215ns ± 5%   164ns ± 5%  -23.91%  (p=0.000 n=58+59)
BM_eigen_exp_double/512    1.48µs ± 9%  1.03µs ± 4%  -30.84%  (p=0.000 n=60+55)
BM_eigen_exp_double/4k     11.4µs ± 7%   7.9µs ± 3%  -30.95%  (p=0.000 n=59+57)
BM_eigen_exp_double/32k    91.5µs ± 7%  63.2µs ± 6%  -30.96%  (p=0.000 n=60+58)
BM_eigen_exp_double/256k    732µs ± 6%   507µs ± 6%  -30.73%  (p=0.000 n=59+58)
BM_eigen_exp_double/1M     2.98ms ± 6%  2.04ms ± 6%  -31.78%  (p=0.000 n=49+53)
BM_eigen_exp_float/1       2.18ns ± 0%  2.45ns ± 0%  +12.40%  (p=0.000 n=56+54)
BM_eigen_exp_float/8       29.2ns ± 1%  28.6ns ± 0%   -1.85%  (p=0.000 n=47+48)
BM_eigen_exp_float/64      84.9ns ± 3%  63.4ns ± 5%  -25.28%  (p=0.000 n=56+59)
BM_eigen_exp_float/512      523ns ± 3%   325ns ± 6%  -37.73%  (p=0.000 n=56+47)
BM_eigen_exp_float/4k      4.06µs ± 4%  2.43µs ± 4%  -40.19%  (p=0.000 n=48+50)
BM_eigen_exp_float/32k     32.1µs ± 5%  19.2µs ± 4%  -40.21%  (p=0.000 n=50+49)
BM_eigen_exp_float/256k     258µs ± 4%   155µs ± 7%  -39.73%  (p=0.000 n=51+59)
BM_eigen_exp_float/1M      1.03ms ± 4%  0.62ms ± 7%  -39.97%  (p=0.000 n=57+57)


AVX512:

name                       old cpu/op   new cpu/op   delta
BM_eigen_exp_double/1      3.26ns ± 4%  2.73ns ± 1%  -16.28%  (p=0.000 n=50+50)
BM_eigen_exp_double/8       108ns ± 5%   101ns ± 6%   -6.60%  (p=0.000 n=60+48)
BM_eigen_exp_double/64      199ns ± 6%   170ns ± 5%  -14.53%  (p=0.000 n=60+59)
BM_eigen_exp_double/512     857ns ± 7%   578ns ± 7%  -32.51%  (p=0.000 n=58+57)
BM_eigen_exp_double/4k     6.13µs ± 8%  3.89µs ± 9%  -36.46%  (p=0.000 n=56+50)
BM_eigen_exp_double/32k    48.4µs ± 9%  30.8µs ± 9%  -36.46%  (p=0.000 n=58+51)
BM_eigen_exp_double/256k    388µs ± 9%   254µs ± 6%  -34.45%  (p=0.000 n=49+50)
BM_eigen_exp_double/1M     1.58ms ±12%  1.03ms ± 9%  -34.52%  (p=0.000 n=57+59)
BM_eigen_exp_float/1       3.56ns ± 4%  3.42ns ± 5%   -3.89%  (p=0.000 n=49+50)
BM_eigen_exp_float/8       32.1ns ± 5%  32.2ns ± 5%     ~     (p=0.406 n=51+49)
BM_eigen_exp_float/64      90.2ns ± 5%  83.4ns ± 7%   -7.62%  (p=0.000 n=59+58)
BM_eigen_exp_float/512      279ns ± 4%   214ns ± 5%  -23.45%  (p=0.000 n=51+55)
BM_eigen_exp_float/4k      1.80µs ± 5%  1.31µs ± 5%  -27.37%  (p=0.000 n=56+57)
BM_eigen_exp_float/32k     13.9µs ± 4%  10.5µs ± 6%  -24.76%  (p=0.000 n=56+56)
BM_eigen_exp_float/256k     120µs ± 6%   101µs ± 5%  -16.20%  (p=0.000 n=58+56)
BM_eigen_exp_float/1M       482µs ± 5%   406µs ± 6%  -15.75%  (p=0.000 n=59+51)
Edited by Rasmus Munk Larsen

Merge request reports

Loading