Improve speed and accuracy of erf()
This reduces the maximum error from 4 to 3 ulps, and nets the following speedups:
AVX2+FMA:
name old cpu/op new cpu/op delta
BM_eigen_erf_float/64 46.1ns ± 6% 34.8ns ± 4% -24.40% (p=0.000 n=58+46)
BM_eigen_erf_float/512 289ns ± 6% 191ns ± 4% -33.80% (p=0.000 n=48+59)
BM_eigen_erf_float/4k 2.24µs ± 4% 1.44µs ± 4% -35.49% (p=0.000 n=51+59)
BM_eigen_erf_float/32k 17.8µs ± 5% 11.5µs ± 5% -35.27% (p=0.000 n=50+60)
BM_eigen_erf_float/256k 142µs ± 6% 101µs ± 4% -28.80% (p=0.000 n=55+59)
BM_eigen_erf_float/1M 567µs ± 4% 404µs ± 4% -28.81% (p=0.000 n=54+47)
SSE 4.2:
name old cpu/op new cpu/op delta
BM_eigen_erf_float/64 64.5ns ± 6% 55.6ns ± 6% -13.72% (p=0.000 n=56+58)
BM_eigen_erf_float/512 381ns ± 4% 333ns ± 3% -12.56% (p=0.000 n=47+49)
BM_eigen_erf_float/4k 2.93µs ± 5% 2.56µs ± 3% -12.80% (p=0.000 n=53+54)
BM_eigen_erf_float/32k 23.4µs ± 6% 20.3µs ± 4% -13.15% (p=0.000 n=55+58)
BM_eigen_erf_float/256k 187µs ± 6% 163µs ± 3% -12.94% (p=0.000 n=56+60)
BM_eigen_erf_float/1M 742µs ± 6% 652µs ± 3% -12.14% (p=0.000 n=59+59)