Improve plog: 20% speedup for float + handle denormals
This replaces !784 (closed)
- For
float, replace the degree 10 polynomial approximation oflog(1+x)on[sqrt(0.5)-1;sqrt(2)-1]by a (3,3) rational approximation. This speeds up the function by ~20% for AVX2. The max relative error increases slightly from 2 ulp to 2.2 ulp for arguments > 1e-15. For tiny arguments the error in both the old and new implementation rises to 64 ulp as x approachesstd::numeric_limits<float>::denorm_min(). This is likely related to the range reduction and remains to be investigated. - Change argument clamping such that
log(x)does not incorrecctly saturate at~-88for denormalizedfloatarguments, but continues down to~-104for positive denormal arguments. A similar fix is done fordouble. - Re-enable a test for computing
log(denorm_min).
Thanks to my colleague James Lottes for suggesting this change and deriving the (3,3) approximant.
Benchmark numbers for AVX2:
name old cpu/op new cpu/op delta
BM_eigen_log_float/1 3.55ns ± 0% 3.27ns ± 0% -7.78% (p=0.000 n=48+49)
BM_eigen_log_float/8 34.4ns ± 5% 32.7ns ± 0% -4.97% (p=0.000 n=50+38)
BM_eigen_log_float/64 107ns ± 5% 86ns ± 3% -19.69% (p=0.000 n=60+60)
BM_eigen_log_float/512 640ns ± 5% 502ns ± 5% -21.56% (p=0.000 n=60+60)
BM_eigen_log_float/4k 4.94µs ± 5% 3.84µs ± 3% -22.22% (p=0.000 n=60+51)
BM_eigen_log_float/32k 39.1µs ± 4% 30.5µs ± 3% -22.07% (p=0.000 n=46+50)
BM_eigen_log_float/256k 313µs ± 4% 244µs ± 4% -21.93% (p=0.000 n=45+50)
BM_eigen_log_float/1M 1.26ms ± 4% 0.97ms ± 2% -23.06% (p=0.000 n=39+30)
name old time/op new time/op delta
BM_eigen_log_float/1 3.55ns ± 0% 3.27ns ± 0% -7.79% (p=0.000 n=41+49)
BM_eigen_log_float/8 34.4ns ± 5% 32.7ns ± 0% -4.98% (p=0.000 n=50+38)
BM_eigen_log_float/64 107ns ± 5% 86ns ± 3% -19.68% (p=0.000 n=60+60)
BM_eigen_log_float/512 640ns ± 5% 502ns ± 5% -21.56% (p=0.000 n=60+60)
BM_eigen_log_float/4k 4.93µs ± 5% 3.84µs ± 3% -22.19% (p=0.000 n=60+52)
BM_eigen_log_float/32k 39.1µs ± 4% 30.5µs ± 3% -22.06% (p=0.000 n=46+50)
BM_eigen_log_float/256k 313µs ± 4% 244µs ± 4% -21.94% (p=0.000 n=45+50)
BM_eigen_log_float/1M 1.26ms ± 4% 0.97ms ± 2% -23.07% (p=0.000 n=39+30)
name old INSTRUCTIONS/op new INSTRUCTIONS/op delta
BM_eigen_log_float/1 41.0 ± 0% 41.0 ± 0% ~ (all samples are equal)
BM_eigen_log_float/8 328 ± 0% 329 ± 0% +0.30% (p=0.000 n=48+48)
BM_eigen_log_float/64 778 ± 0% 684 ± 0% -12.08% (p=0.000 n=56+60)
BM_eigen_log_float/512 4.03k ± 0% 3.26k ± 0% -19.03% (p=0.000 n=53+56)
BM_eigen_log_float/4k 30.0k ± 0% 23.9k ± 0% -20.47% (p=0.000 n=56+46)
BM_eigen_log_float/32k 238k ± 0% 189k ± 0% -20.66% (p=0.000 n=37+44)
BM_eigen_log_float/256k 1.90M ± 0% 1.51M ± 0% -20.69% (p=0.000 n=38+45)
BM_eigen_log_float/1M 7.60M ± 0% 6.03M ± 0% -20.69% (p=0.000 n=36+35)
name old CYCLES/op new CYCLES/op delta
BM_eigen_log_float/1 13.1 ± 0% 12.1 ± 0% -7.81% (p=0.000 n=40+50)
BM_eigen_log_float/8 127 ± 5% 121 ± 0% -4.98% (p=0.000 n=50+37)
BM_eigen_log_float/64 362 ± 2% 293 ± 0% -18.99% (p=0.000 n=56+60)
BM_eigen_log_float/512 2.17k ± 2% 1.71k ± 1% -21.00% (p=0.000 n=60+60)
BM_eigen_log_float/4k 16.7k ± 2% 13.1k ± 1% -21.65% (p=0.000 n=59+52)
BM_eigen_log_float/32k 133k ± 3% 104k ± 1% -21.58% (p=0.000 n=46+45)
BM_eigen_log_float/256k 1.06M ± 2% 0.83M ± 1% -21.41% (p=0.000 n=45+50)
BM_eigen_log_float/1M 4.26M ± 3% 3.33M ± 1% -21.77% (p=0.000 n=39+38)
Edited by Rasmus Munk Larsen