Vectorize erf(x) for double.
Measured speedup:
SSE 4.2:
name old cpu/op new cpu/op delta
BM_eigen_erf_double/1 3.00ns ± 1% 3.00ns ± 0% ~ (p=0.455 n=48+47)
BM_eigen_erf_double/8 34.8ns ± 1% 17.9ns ± 8% -48.42% (p=0.000 n=45+59)
BM_eigen_erf_double/64 294ns ± 2% 80ns ±12% -72.86% (p=0.000 n=53+59)
BM_eigen_erf_double/512 2.32µs ± 3% 0.57µs ± 6% -75.45% (p=0.000 n=55+59)
BM_eigen_erf_double/4k 18.4µs ± 2% 4.5µs ± 6% -75.59% (p=0.000 n=50+60)
BM_eigen_erf_double/32k 147µs ± 1% 36µs ± 6% -75.69% (p=0.000 n=54+49)
BM_eigen_erf_double/256k 1.18ms ± 2% 0.29ms ± 5% -75.83% (p=0.000 n=55+55)
BM_eigen_erf_double/1M 4.76ms ± 3% 1.17ms ±11% -75.34% (p=0.000 n=58+58)
AVX2+FMA:
name old cpu/op new cpu/op delta
BM_eigen_erf_double/1 3.00ns ± 1% 3.54ns ± 1% +18.14% (p=0.000 n=47+46)
BM_eigen_erf_double/8 34.8ns ± 1% 20.5ns ± 7% -41.28% (p=0.000 n=45+54)
BM_eigen_erf_double/64 295ns ± 3% 65ns ±13% -78.03% (p=0.000 n=53+59)
BM_eigen_erf_double/512 2.32µs ± 3% 0.39µs ± 4% -83.20% (p=0.000 n=57+47)
BM_eigen_erf_double/4k 18.5µs ± 3% 3.0µs ± 7% -83.63% (p=0.000 n=57+53)
BM_eigen_erf_double/32k 148µs ± 3% 24µs ± 3% -83.54% (p=0.000 n=58+53)
BM_eigen_erf_double/256k 1.19ms ± 3% 0.21ms ± 4% -82.05% (p=0.000 n=57+55)
BM_eigen_erf_double/1M 4.75ms ± 2% 0.87ms ± 8% -81.69% (p=0.000 n=56+55)
AVX512:
name old cpu/op new cpu/op delta
BM_eigen_erf_double/1 3.01ns ± 1% 3.56ns ± 1% +18.39% (p=0.000 n=51+47)
BM_eigen_erf_double/8 35.1ns ± 3% 33.3ns ± 1% -5.34% (p=0.000 n=46+42)
BM_eigen_erf_double/64 306ns ± 9% 76ns ± 2% -75.27% (p=0.000 n=50+60)
BM_eigen_erf_double/512 2.39µs ± 8% 0.35µs ± 3% -85.17% (p=0.000 n=55+48)
BM_eigen_erf_double/4k 19.3µs ±12% 2.6µs ± 2% -86.62% (p=0.000 n=56+53)
BM_eigen_erf_double/32k 154µs ± 9% 20µs ± 3% -86.70% (p=0.000 n=55+60)
BM_eigen_erf_double/256k 1.23ms ± 7% 0.18ms ± 4% -85.02% (p=0.000 n=59+57)
BM_eigen_erf_double/1M 4.98ms ±12% 0.74ms ± 3% -85.12% (p=0.000 n=58+55)