Add vectorized implementation of tanh<double>

Add vectorized implementation of tanh<double>.

Benchmark measurements show the following speedups:

ISA Speedup
SSE 4.2 4x
AVX2+FMA 13.3x
AVX512 22x

Full benchmark results here.

Maximum difference from std::tanh<double> is 4 ULPs.

Edited by Rasmus Munk Larsen

Merge request reports

Loading