Add vectorized implementation of tanh<double>
Add vectorized implementation of tanh<double>.
Benchmark measurements show the following speedups:
| ISA | Speedup |
|---|---|
| SSE 4.2 | 4x |
| AVX2+FMA | 13.3x |
| AVX512 | 22x |
Full benchmark results here.
Maximum difference from std::tanh<double> is 4 ULPs.
Edited by Rasmus Munk Larsen