Make sure we return +/-1 above the clamping point for Erf().
This also gives a tiny speedup in some cases, here measured for AVX2 on Skylake compiled with -march=skylake.
name old cpu/op new cpu/op delta
BM_eigen_erf_float/1 1.10ns ± 0% 1.09ns ± 0% -0.43% (p=0.000 n=55+57)
BM_eigen_erf_float/8 13.9ns ± 1% 12.5ns ± 6% -10.05% (p=0.000 n=48+60)
BM_eigen_erf_float/64 38.9ns ± 6% 36.4ns ± 3% -6.31% (p=0.000 n=46+42)
BM_eigen_erf_float/512 231ns ± 3% 221ns ± 4% -4.17% (p=0.000 n=52+47)
BM_eigen_erf_float/4k 1.80µs ± 3% 1.73µs ± 5% -3.55% (p=0.000 n=58+53)
BM_eigen_erf_float/32k 14.2µs ± 3% 13.8µs ± 7% -3.33% (p=0.000 n=51+54)
BM_eigen_erf_float/256k 117µs ± 5% 115µs ± 5% -1.76% (p=0.000 n=59+57)
BM_eigen_erf_float/1M 470µs ± 3% 463µs ± 6% -1.47% (p=0.000 n=58+60)