Improve plog: 20% speedup for float + handle denormals (!799) · Merge requests · libeigen / eigen

For float, replace the degree 10 polynomial approximation of log(1+x) on [sqrt(0.5)-1;sqrt(2)-1] by a (3,3) rational approximation. This speeds up the function by ~20% for AVX2. The max relative error increases slightly from 2 ulp to 2.2 ulp for arguments > 1e-15. For tiny arguments the error in both the old and new implementation rises to 64 ulp as x approaches std::numeric_limits<float>::denorm_min(). This is likely related to the range reduction and remains to be investigated.
Change argument clamping such that log(x) does not incorrecctly saturate at ~-88 for denormalized float arguments, but continues down to ~-104 for positive denormal arguments. A similar fix is done for double.
Re-enable a test for computing log(denorm_min).

Thanks to my colleague James Lottes for suggesting this change and deriving the (3,3) approximant.

Benchmark numbers for AVX2:

name                      old cpu/op  new cpu/op  delta
BM_eigen_log_float/1      3.55ns ± 0%  3.27ns ± 0%   -7.78%  (p=0.000 n=48+49)
BM_eigen_log_float/8      34.4ns ± 5%  32.7ns ± 0%   -4.97%  (p=0.000 n=50+38)
BM_eigen_log_float/64      107ns ± 5%    86ns ± 3%  -19.69%  (p=0.000 n=60+60)
BM_eigen_log_float/512     640ns ± 5%   502ns ± 5%  -21.56%  (p=0.000 n=60+60)
BM_eigen_log_float/4k     4.94µs ± 5%  3.84µs ± 3%  -22.22%  (p=0.000 n=60+51)
BM_eigen_log_float/32k    39.1µs ± 4%  30.5µs ± 3%  -22.07%  (p=0.000 n=46+50)
BM_eigen_log_float/256k    313µs ± 4%   244µs ± 4%  -21.93%  (p=0.000 n=45+50)
BM_eigen_log_float/1M     1.26ms ± 4%  0.97ms ± 2%  -23.06%  (p=0.000 n=39+30)

name                      old time/op             new time/op             delta
BM_eigen_log_float/1      3.55ns ± 0%             3.27ns ± 0%   -7.79%        (p=0.000 n=41+49)
BM_eigen_log_float/8      34.4ns ± 5%             32.7ns ± 0%   -4.98%        (p=0.000 n=50+38)
BM_eigen_log_float/64      107ns ± 5%               86ns ± 3%  -19.68%        (p=0.000 n=60+60)
BM_eigen_log_float/512     640ns ± 5%              502ns ± 5%  -21.56%        (p=0.000 n=60+60)
BM_eigen_log_float/4k     4.93µs ± 5%             3.84µs ± 3%  -22.19%        (p=0.000 n=60+52)
BM_eigen_log_float/32k    39.1µs ± 4%             30.5µs ± 3%  -22.06%        (p=0.000 n=46+50)
BM_eigen_log_float/256k    313µs ± 4%              244µs ± 4%  -21.94%        (p=0.000 n=45+50)
BM_eigen_log_float/1M     1.26ms ± 4%             0.97ms ± 2%  -23.07%        (p=0.000 n=39+30)

name                      old INSTRUCTIONS/op     new INSTRUCTIONS/op     delta
BM_eigen_log_float/1        41.0 ± 0%               41.0 ± 0%     ~     (all samples are equal)
BM_eigen_log_float/8         328 ± 0%                329 ± 0%   +0.30%        (p=0.000 n=48+48)
BM_eigen_log_float/64        778 ± 0%                684 ± 0%  -12.08%        (p=0.000 n=56+60)
BM_eigen_log_float/512     4.03k ± 0%              3.26k ± 0%  -19.03%        (p=0.000 n=53+56)
BM_eigen_log_float/4k      30.0k ± 0%              23.9k ± 0%  -20.47%        (p=0.000 n=56+46)
BM_eigen_log_float/32k      238k ± 0%               189k ± 0%  -20.66%        (p=0.000 n=37+44)
BM_eigen_log_float/256k    1.90M ± 0%              1.51M ± 0%  -20.69%        (p=0.000 n=38+45)
BM_eigen_log_float/1M      7.60M ± 0%              6.03M ± 0%  -20.69%        (p=0.000 n=36+35)

name                      old CYCLES/op           new CYCLES/op           delta
BM_eigen_log_float/1        13.1 ± 0%               12.1 ± 0%   -7.81%        (p=0.000 n=40+50)
BM_eigen_log_float/8         127 ± 5%                121 ± 0%   -4.98%        (p=0.000 n=50+37)
BM_eigen_log_float/64        362 ± 2%                293 ± 0%  -18.99%        (p=0.000 n=56+60)
BM_eigen_log_float/512     2.17k ± 2%              1.71k ± 1%  -21.00%        (p=0.000 n=60+60)
BM_eigen_log_float/4k      16.7k ± 2%              13.1k ± 1%  -21.65%        (p=0.000 n=59+52)
BM_eigen_log_float/32k      133k ± 3%               104k ± 1%  -21.58%        (p=0.000 n=46+45)
BM_eigen_log_float/256k    1.06M ± 2%              0.83M ± 1%  -21.41%        (p=0.000 n=45+50)
BM_eigen_log_float/1M      4.26M ± 3%              3.33M ± 1%  -21.77%        (p=0.000 n=39+38)

Edited Jan 05, 2022 by Rasmus Munk Larsen

Improve plog: 20% speedup for float + handle denormals

Merge request reports