Speed up and improve accuracy of tanh.
This MR implements a new rational approximation of tanh for float. The approximation was computed using the awesome rminimax tool using the command
ratapprox --function="tanh(x)" --dom='[-8.67,8.67]' --num="odd" --den="even" --type="[9,8]" --numF="[SG]" --denF="[SG]" --log --output=tanhf.sollya --dispCoeff="dec" .
Measured over all floats, this reduces the maximum error from 2.9 to 2.5 ulps when FMA is available. Without FMA, the maximum error is unchanged at 3 ulps.
It speeds up tanh by about 20% for SSE and 40-50% for AVX2+FMA:
AVX2+FMA (-march=haswell):
name old cpu/op new cpu/op delta
BM_eigen_tanh_float/1 3.40ns ± 0% 1.91ns ± 0% -43.80% (p=0.000 n=50+55)
BM_eigen_tanh_float/8 19.2ns ± 0% 12.6ns ± 3% -34.07% (p=0.000 n=58+50)
BM_eigen_tanh_float/64 54.6ns ± 7% 29.4ns ± 8% -46.17% (p=0.000 n=60+50)
BM_eigen_tanh_float/512 346ns ± 8% 166ns ± 4% -51.97% (p=0.000 n=49+59)
BM_eigen_tanh_float/4k 2.71µs ± 8% 1.27µs ± 3% -53.29% (p=0.000 n=55+58)
BM_eigen_tanh_float/32k 21.3µs ± 7% 10.1µs ± 5% -52.74% (p=0.000 n=52+59)
BM_eigen_tanh_float/256k 170µs ± 7% 95µs ± 5% -43.89% (p=0.000 n=49+60)
BM_eigen_tanh_float/1M 689µs ±11% 381µs ± 4% -44.81% (p=0.000 n=60+45)
SSE 4.2:
name old cpu/op new cpu/op delta
BM_eigen_tanh_float/1 2.31ns ± 0% 2.04ns ± 0% -11.70% (p=0.000 n=54+51)
BM_eigen_tanh_float/8 19.7ns ± 3% 14.4ns ± 0% -27.02% (p=0.000 n=47+49)
BM_eigen_tanh_float/64 89.7ns ± 4% 71.3ns ± 0% -20.47% (p=0.000 n=48+49)
BM_eigen_tanh_float/512 652ns ± 1% 530ns ± 0% -18.66% (p=0.000 n=45+46)
BM_eigen_tanh_float/4k 5.14µs ± 0% 4.19µs ± 1% -18.50% (p=0.000 n=37+56)
BM_eigen_tanh_float/32k 41.0µs ± 0% 33.6µs ± 2% -18.01% (p=0.000 n=39+49)
BM_eigen_tanh_float/256k 328µs ± 0% 271µs ± 1% -17.27% (p=0.000 n=32+48)
BM_eigen_tanh_float/1M 1.31ms ± 1% 1.09ms ± 0% -17.25% (p=0.000 n=45+50)
Edited by Rasmus Munk Larsen