Speed up and improve accuracy of tanh.

This MR implements a new rational approximation of tanh for float. The approximation was computed using the awesome rminimax tool using the command

ratapprox --function="tanh(x)" --dom='[-8.67,8.67]' --num="odd" --den="even" --type="[9,8]" --numF="[SG]" --denF="[SG]" --log --output=tanhf.sollya --dispCoeff="dec" .

Measured over all floats, this reduces the maximum error from 2.9 to 2.5 ulps when FMA is available. Without FMA, the maximum error is unchanged at 3 ulps.

It speeds up tanh by about 20% for SSE and 40-50% for AVX2+FMA:

AVX2+FMA (-march=haswell):

name                       old cpu/op   new cpu/op   delta
BM_eigen_tanh_float/1      3.40ns ± 0%  1.91ns ± 0%  -43.80%  (p=0.000 n=50+55)
BM_eigen_tanh_float/8      19.2ns ± 0%  12.6ns ± 3%  -34.07%  (p=0.000 n=58+50)
BM_eigen_tanh_float/64     54.6ns ± 7%  29.4ns ± 8%  -46.17%  (p=0.000 n=60+50)
BM_eigen_tanh_float/512     346ns ± 8%   166ns ± 4%  -51.97%  (p=0.000 n=49+59)
BM_eigen_tanh_float/4k     2.71µs ± 8%  1.27µs ± 3%  -53.29%  (p=0.000 n=55+58)
BM_eigen_tanh_float/32k    21.3µs ± 7%  10.1µs ± 5%  -52.74%  (p=0.000 n=52+59)
BM_eigen_tanh_float/256k    170µs ± 7%    95µs ± 5%  -43.89%  (p=0.000 n=49+60)
BM_eigen_tanh_float/1M      689µs ±11%   381µs ± 4%  -44.81%  (p=0.000 n=60+45)


SSE 4.2:

name                       old cpu/op   new cpu/op   delta
BM_eigen_tanh_float/1      2.31ns ± 0%  2.04ns ± 0%  -11.70%  (p=0.000 n=54+51)
BM_eigen_tanh_float/8      19.7ns ± 3%  14.4ns ± 0%  -27.02%  (p=0.000 n=47+49)
BM_eigen_tanh_float/64     89.7ns ± 4%  71.3ns ± 0%  -20.47%  (p=0.000 n=48+49)
BM_eigen_tanh_float/512     652ns ± 1%   530ns ± 0%  -18.66%  (p=0.000 n=45+46)
BM_eigen_tanh_float/4k     5.14µs ± 0%  4.19µs ± 1%  -18.50%  (p=0.000 n=37+56)
BM_eigen_tanh_float/32k    41.0µs ± 0%  33.6µs ± 2%  -18.01%  (p=0.000 n=39+49)
BM_eigen_tanh_float/256k    328µs ± 0%   271µs ± 1%  -17.27%  (p=0.000 n=32+48)
BM_eigen_tanh_float/1M     1.31ms ± 1%  1.09ms ± 0%  -17.25%  (p=0.000 n=45+50)
Edited by Rasmus Munk Larsen

Merge request reports

Loading