Add a ptanh_float implementation that is accurate to 1 ULP

This introduces a new implementation of tanh for float to make the function accurate to 1 ULP, whereas it was previously off by up to 6 ULP. This has some consequences for performance (measured on a Intel(R) Xeon(R) Gold 6154 CPU):

  1. The speed increases by ~25% if all elements in a packet have a magnitude less than 1.25.
  2. The speed decreases by ~2-2.5x if at least one element in a packet has a magnitude of 1.25 or more.

For now, the new slower but more accurate implementation is used unless EIGEN_FAST_MATH is set. We probably want to provide the two separate implementations as separate APIs in the future.

Measurements for |x| < 1.25:

name                       cpu/op        cpu/op      vs base                   
BM_eigen_tanh_float/1       2.698n ± 1%   3.001n ± 1%  +11.23% (p=0.000 n=66+65)
BM_eigen_tanh_float/8       2.598n ± 1%   3.273n ± 0%  +25.98% (p=0.000 n=66+60)
BM_eigen_tanh_float/64      19.81n ± 1%   15.66n ± 1%  -20.95% (n=72)
BM_eigen_tanh_float/512     160.9n ± 0%   115.9n ± 1%  -27.97% (n=72)
BM_eigen_tanh_float/4k     1260.2n ± 0%   923.0n ± 1%  -26.75% (n=72)
BM_eigen_tanh_float/32k    10.373µ ± 1%   7.301µ ± 1%  -29.61% (n=72)
BM_eigen_tanh_float/256k    96.63µ ± 0%   82.85µ ± 0%  -14.26% (n=72)
BM_eigen_tanh_float/1M      386.1µ ± 1%   331.2µ ± 0%  -14.21% (n=54+60)
geomean                    568.4n        489.5n       -13.88%

Measurements for |x| >= 1.25:

name                       cpu/op         cpu/op      vs base                    
BM_eigen_tanh_float/1      2.761n ± 1%    9.160n ± 1%  +231.76% (p=0.000 n=66+72)
BM_eigen_tanh_float/8      2.647n ± 0%    8.928n ± 1%  +237.31% (p=0.000 n=66+72)
BM_eigen_tanh_float/64     20.20n ± 0%    69.33n ± 1%  +243.18% (p=0.000 n=72)
BM_eigen_tanh_float/512    164.2n ± 1%    551.4n ± 1%  +235.79% (p=0.000 n=72)
BM_eigen_tanh_float/4k     1.289µ ± 0%    4.448µ ± 0%  +245.19% (p=0.000 n=72)
BM_eigen_tanh_float/32k    10.59µ ± 0%    35.51µ ± 1%  +235.40% (p=0.000 n=72+60)
BM_eigen_tanh_float/256k   98.68µ ± 1%   283.56µ ± 1%  +187.35% (p=0.000 n=72+64)
BM_eigen_tanh_float/1M     393.8µ ± 1%   1130.6µ ± 1%  +187.10% (p=0.000 n=54+72)
geomean                    580.3n         1.883µ       +224.57%
Edited by Rasmus Munk Larsen

Merge request reports

Loading