Add a ptanh_float implementation that is accurate to 1 ULP
This introduces a new implementation of tanh for float to make the function accurate to 1 ULP, whereas it was previously off by up to 6 ULP. This has some consequences for performance (measured on a Intel(R) Xeon(R) Gold 6154 CPU):
- The speed increases by ~25% if all elements in a packet have a magnitude less than 1.25.
- The speed decreases by ~2-2.5x if at least one element in a packet has a magnitude of 1.25 or more.
For now, the new slower but more accurate implementation is used unless EIGEN_FAST_MATH is set. We probably want to provide the two separate implementations as separate APIs in the future.
Measurements for |x| < 1.25:
name cpu/op cpu/op vs base
BM_eigen_tanh_float/1 2.698n ± 1% 3.001n ± 1% +11.23% (p=0.000 n=66+65)
BM_eigen_tanh_float/8 2.598n ± 1% 3.273n ± 0% +25.98% (p=0.000 n=66+60)
BM_eigen_tanh_float/64 19.81n ± 1% 15.66n ± 1% -20.95% (n=72)
BM_eigen_tanh_float/512 160.9n ± 0% 115.9n ± 1% -27.97% (n=72)
BM_eigen_tanh_float/4k 1260.2n ± 0% 923.0n ± 1% -26.75% (n=72)
BM_eigen_tanh_float/32k 10.373µ ± 1% 7.301µ ± 1% -29.61% (n=72)
BM_eigen_tanh_float/256k 96.63µ ± 0% 82.85µ ± 0% -14.26% (n=72)
BM_eigen_tanh_float/1M 386.1µ ± 1% 331.2µ ± 0% -14.21% (n=54+60)
geomean 568.4n 489.5n -13.88%
Measurements for |x| >= 1.25:
name cpu/op cpu/op vs base
BM_eigen_tanh_float/1 2.761n ± 1% 9.160n ± 1% +231.76% (p=0.000 n=66+72)
BM_eigen_tanh_float/8 2.647n ± 0% 8.928n ± 1% +237.31% (p=0.000 n=66+72)
BM_eigen_tanh_float/64 20.20n ± 0% 69.33n ± 1% +243.18% (p=0.000 n=72)
BM_eigen_tanh_float/512 164.2n ± 1% 551.4n ± 1% +235.79% (p=0.000 n=72)
BM_eigen_tanh_float/4k 1.289µ ± 0% 4.448µ ± 0% +245.19% (p=0.000 n=72)
BM_eigen_tanh_float/32k 10.59µ ± 0% 35.51µ ± 1% +235.40% (p=0.000 n=72+60)
BM_eigen_tanh_float/256k 98.68µ ± 1% 283.56µ ± 1% +187.35% (p=0.000 n=72+64)
BM_eigen_tanh_float/1M 393.8µ ± 1% 1130.6µ ± 1% +187.10% (p=0.000 n=54+72)
geomean 580.3n 1.883µ +224.57%
Edited by Rasmus Munk Larsen