Vectorize erfc(x) for double and improve erfc(x) for float.
The implementation for double has a maximum relative error of 7 ulps, and this change improves the maximum relative error for float to 5 ulps.
This change also fixes a bug in !1710 (merged), which implemented but didn't enable the vectorized version of erfc for float (duh!). So the speedup numbers (included below) are much more impressive than reported in that MR description, despite the improved accuracy.
Benchmark numbers measured relative to current master branch with clang 18 on Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz:
SSE 4.2:
name old cpu/op new cpu/op delta
BM_eigen_erfc_float/1 2.46ns ± 0% 3.00ns ± 0% +21.92% (p=0.000 n=46+49)
BM_eigen_erfc_float/8 21.5ns ± 2% 25.7ns ± 0% +19.63% (p=0.000 n=51+47)
BM_eigen_erfc_float/64 174ns ± 0% 53ns ± 6% -69.71% (p=0.000 n=53+60)
BM_eigen_erfc_float/512 1.40µs ± 1% 0.26µs ± 4% -81.43% (p=0.000 n=59+55)
BM_eigen_erfc_float/4k 11.2µs ± 1% 1.9µs ± 6% -82.70% (p=0.000 n=58+60)
BM_eigen_erfc_float/32k 89.3µs ± 1% 15.1µs ± 4% -83.06% (p=0.000 n=59+60)
BM_eigen_erfc_float/256k 714µs ± 0% 123µs ± 5% -82.80% (p=0.000 n=51+60)
BM_eigen_erfc_float/1M 2.86ms ± 0% 0.49ms ± 7% -82.76% (p=0.000 n=50+60)
BM_eigen_erfc_double/1 3.00ns ± 0% 3.00ns ± 1% ~ (p=0.580 n=50+50)
BM_eigen_erfc_double/8 36.7ns ± 2% 27.1ns ± 5% -26.06% (p=0.000 n=49+54)
BM_eigen_erfc_double/64 331ns ± 4% 120ns ±10% -63.77% (p=0.000 n=44+60)
BM_eigen_erfc_double/512 2.69µs ± 4% 0.83µs ± 6% -69.28% (p=0.000 n=51+59)
BM_eigen_erfc_double/4k 21.7µs ± 3% 6.5µs ± 6% -69.90% (p=0.000 n=57+58)
BM_eigen_erfc_double/32k 174µs ± 4% 52µs ± 6% -70.13% (p=0.000 n=59+59)
BM_eigen_erfc_double/256k 1.38ms ± 3% 0.42ms ± 5% -69.77% (p=0.000 n=56+52)
BM_eigen_erfc_double/1M 5.33ms ± 5% 1.68ms ± 7% -68.51% (p=0.000 n=60+55)
AVX2 + FMA:
name old cpu/op new cpu/op delta
BM_eigen_erfc_float/1 3.27ns ± 1% 2.73ns ± 0% -16.60% (p=0.000 n=46+51)
BM_eigen_erfc_float/8 13.3ns ± 1% 16.1ns ± 1% +20.90% (p=0.000 n=52+56)
BM_eigen_erfc_float/64 92.0ns ± 0% 34.1ns ± 8% -63.00% (p=0.000 n=54+48)
BM_eigen_erfc_float/512 726ns ± 1% 182ns ± 4% -74.94% (p=0.000 n=57+53)
BM_eigen_erfc_float/4k 5.83µs ± 0% 1.39µs ± 2% -76.13% (p=0.000 n=56+58)
BM_eigen_erfc_float/32k 46.3µs ± 1% 11.0µs ± 3% -76.28% (p=0.000 n=58+59)
BM_eigen_erfc_float/256k 371µs ± 1% 97µs ± 4% -73.80% (p=0.000 n=49+58)
BM_eigen_erfc_float/1M 1.49ms ± 0% 0.39ms ± 3% -73.77% (p=0.000 n=55+46)
BM_eigen_erfc_double/1 3.00ns ± 1% 2.73ns ± 1% -9.00% (p=0.000 n=46+50)
BM_eigen_erfc_double/8 37.3ns ± 2% 14.9ns ± 3% -60.17% (p=0.000 n=39+51)
BM_eigen_erfc_double/64 326ns ± 8% 85ns ± 9% -73.94% (p=0.000 n=50+59)
BM_eigen_erfc_double/512 2.58µs ± 9% 0.61µs ± 5% -76.39% (p=0.000 n=55+52)
BM_eigen_erfc_double/4k 20.1µs ± 2% 4.8µs ± 4% -76.07% (p=0.000 n=46+53)
BM_eigen_erfc_double/32k 160µs ± 8% 38µs ± 3% -76.20% (p=0.000 n=48+44)
BM_eigen_erfc_double/256k 1.27ms ± 6% 0.31ms ± 5% -75.98% (p=0.000 n=47+43)
BM_eigen_erfc_double/1M 5.11ms ± 5% 1.26ms ± 7% -75.29% (p=0.000 n=48+60)
AVX512:
name old cpu/op new cpu/op delta
BM_eigen_erfc_float/1 2.83ns ± 4% 3.27ns ± 1% +15.60% (p=0.000 n=55+47)
BM_eigen_erfc_float/8 13.0ns ± 0% 15.9ns ± 0% +22.12% (p=0.000 n=51+57)
BM_eigen_erfc_float/64 92.1ns ± 1% 38.4ns ± 9% -58.30% (p=0.000 n=51+50)
BM_eigen_erfc_float/512 730ns ± 2% 100ns ± 6% -86.32% (p=0.000 n=58+59)
BM_eigen_erfc_float/4k 5.80µs ± 1% 0.62µs ± 5% -89.25% (p=0.000 n=58+59)
BM_eigen_erfc_float/32k 46.7µs ± 1% 5.5µs ± 7% -88.23% (p=0.000 n=57+60)
BM_eigen_erfc_float/256k 374µs ± 0% 81µs ± 3% -78.31% (p=0.000 n=41+59)
BM_eigen_erfc_float/1M 1.49ms ± 1% 0.32ms ± 3% -78.28% (p=0.000 n=51+49)
BM_eigen_erfc_double/1 3.00ns ± 0% 3.00ns ± 0% ~ (p=0.441 n=52+48)
BM_eigen_erfc_double/8 37.8ns ± 1% 21.1ns ± 0% -44.27% (p=0.000 n=43+49)
BM_eigen_erfc_double/64 325ns ± 3% 66ns ±10% -79.62% (p=0.000 n=45+58)
BM_eigen_erfc_double/512 2.57µs ± 2% 0.38µs ± 7% -85.17% (p=0.000 n=48+50)
BM_eigen_erfc_double/4k 20.6µs ± 1% 3.0µs ± 7% -85.37% (p=0.000 n=49+53)
BM_eigen_erfc_double/32k 165µs ± 0% 24µs ± 8% -85.52% (p=0.000 n=46+55)
BM_eigen_erfc_double/256k 1.32ms ± 1% 0.22ms ± 5% -83.58% (p=0.000 n=49+57)
BM_eigen_erfc_double/1M 5.36ms ± 4% 0.89ms ±10% -83.46% (p=0.000 n=58+60)
Edited by Rasmus Munk Larsen