Vectorize tan(x)
This implementation has a maximum error of 4 ULP for AVX2+FMA.
Benchmark measurements:
name cpu/op cpu/op vs base
BM_eigen_tan_float/1 6.759n ± 0% 10.999n ± 1% +62.73% (p=0.000 n=72)
BM_eigen_tan_float/8 44.14n ± 0% 10.67n ± 1% -75.84% (n=72)
BM_eigen_tan_float/64 350.33n ± 0% 59.72n ± 2% -82.95% (n=60+72)
BM_eigen_tan_float/512 2761.0n ± 0% 436.4n ± 1% -84.20% (n=66+72)
BM_eigen_tan_float/4k 22.136µ ± 0% 3.472µ ± 1% -84.32% (n=71+60)
BM_eigen_tan_float/32k 176.69µ ± 0% 27.56µ ± 1% -84.41% (n=72+65)
BM_eigen_tan_float/256k 1413.5µ ± 0% 221.5µ ± 2% -84.33% (n=72+70)
BM_eigen_tan_float/1M 5653.5µ ± 0% 877.6µ ± 2% -84.48% (n=72)
geomean 7.403µ 1.657µ -77.62%
name cpu/op cpu/op vs base
BM_eigen_tan_double/1 18.18n ± 0% 19.84n ± 0% +9.11% (p=0.000 n=72)
BM_eigen_tan_double/8 137.76n ± 0% 36.19n ± 1% -73.73% (n=72+59)
BM_eigen_tan_double/64 1100.6n ± 0% 262.1n ± 1% -76.19% (n=72+66)
BM_eigen_tan_double/512 8.769µ ± 0% 2.039µ ± 1% -76.74% (n=72)
BM_eigen_tan_double/4k 70.10µ ± 0% 16.39µ ± 1% -76.62% (n=72)
BM_eigen_tan_double/32k 560.8µ ± 0% 130.8µ ± 1% -76.67% (n=72)
BM_eigen_tan_double/256k 4.487m ± 0% 1.045m ± 1% -76.70% (n=72)
BM_eigen_tan_double/1M 17.970m ± 0% 4.212m ± 1% -76.56% (n=70+69)
geomean 22.94µ 6.605µ -71.21%
Edited by Rasmus Munk Larsen