Vectorize erfc(x) for double and improve erfc(x) for float.

The implementation for double has a maximum relative error of 7 ulps, and this change improves the maximum relative error for float to 5 ulps.

This change also fixes a bug in !1710 (merged), which implemented but didn't enable the vectorized version of erfc for float (duh!). So the speedup numbers (included below) are much more impressive than reported in that MR description, despite the improved accuracy.

Benchmark numbers measured relative to current master branch with clang 18 on Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz:

SSE 4.2:
name                       old cpu/op   new cpu/op   delta
BM_eigen_erfc_float/1      2.46ns ± 0%  3.00ns ± 0%  +21.92%  (p=0.000 n=46+49)
BM_eigen_erfc_float/8      21.5ns ± 2%  25.7ns ± 0%  +19.63%  (p=0.000 n=51+47)
BM_eigen_erfc_float/64      174ns ± 0%    53ns ± 6%  -69.71%  (p=0.000 n=53+60)
BM_eigen_erfc_float/512    1.40µs ± 1%  0.26µs ± 4%  -81.43%  (p=0.000 n=59+55)
BM_eigen_erfc_float/4k     11.2µs ± 1%   1.9µs ± 6%  -82.70%  (p=0.000 n=58+60)
BM_eigen_erfc_float/32k    89.3µs ± 1%  15.1µs ± 4%  -83.06%  (p=0.000 n=59+60)
BM_eigen_erfc_float/256k    714µs ± 0%   123µs ± 5%  -82.80%  (p=0.000 n=51+60)
BM_eigen_erfc_float/1M     2.86ms ± 0%  0.49ms ± 7%  -82.76%  (p=0.000 n=50+60)
BM_eigen_erfc_double/1     3.00ns ± 0%  3.00ns ± 1%     ~     (p=0.580 n=50+50)
BM_eigen_erfc_double/8     36.7ns ± 2%  27.1ns ± 5%  -26.06%  (p=0.000 n=49+54)
BM_eigen_erfc_double/64     331ns ± 4%   120ns ±10%  -63.77%  (p=0.000 n=44+60)
BM_eigen_erfc_double/512   2.69µs ± 4%  0.83µs ± 6%  -69.28%  (p=0.000 n=51+59)
BM_eigen_erfc_double/4k    21.7µs ± 3%   6.5µs ± 6%  -69.90%  (p=0.000 n=57+58)
BM_eigen_erfc_double/32k    174µs ± 4%    52µs ± 6%  -70.13%  (p=0.000 n=59+59)
BM_eigen_erfc_double/256k  1.38ms ± 3%  0.42ms ± 5%  -69.77%  (p=0.000 n=56+52)
BM_eigen_erfc_double/1M    5.33ms ± 5%  1.68ms ± 7%  -68.51%  (p=0.000 n=60+55)


AVX2 + FMA:
name                       old cpu/op   new cpu/op   delta
BM_eigen_erfc_float/1      3.27ns ± 1%  2.73ns ± 0%  -16.60%  (p=0.000 n=46+51)
BM_eigen_erfc_float/8      13.3ns ± 1%  16.1ns ± 1%  +20.90%  (p=0.000 n=52+56)
BM_eigen_erfc_float/64     92.0ns ± 0%  34.1ns ± 8%  -63.00%  (p=0.000 n=54+48)
BM_eigen_erfc_float/512     726ns ± 1%   182ns ± 4%  -74.94%  (p=0.000 n=57+53)
BM_eigen_erfc_float/4k     5.83µs ± 0%  1.39µs ± 2%  -76.13%  (p=0.000 n=56+58)
BM_eigen_erfc_float/32k    46.3µs ± 1%  11.0µs ± 3%  -76.28%  (p=0.000 n=58+59)
BM_eigen_erfc_float/256k    371µs ± 1%    97µs ± 4%  -73.80%  (p=0.000 n=49+58)
BM_eigen_erfc_float/1M     1.49ms ± 0%  0.39ms ± 3%  -73.77%  (p=0.000 n=55+46)
BM_eigen_erfc_double/1     3.00ns ± 1%  2.73ns ± 1%   -9.00%  (p=0.000 n=46+50)
BM_eigen_erfc_double/8     37.3ns ± 2%  14.9ns ± 3%  -60.17%  (p=0.000 n=39+51)
BM_eigen_erfc_double/64     326ns ± 8%    85ns ± 9%  -73.94%  (p=0.000 n=50+59)
BM_eigen_erfc_double/512   2.58µs ± 9%  0.61µs ± 5%  -76.39%  (p=0.000 n=55+52)
BM_eigen_erfc_double/4k    20.1µs ± 2%   4.8µs ± 4%  -76.07%  (p=0.000 n=46+53)
BM_eigen_erfc_double/32k    160µs ± 8%    38µs ± 3%  -76.20%  (p=0.000 n=48+44)
BM_eigen_erfc_double/256k  1.27ms ± 6%  0.31ms ± 5%  -75.98%  (p=0.000 n=47+43)
BM_eigen_erfc_double/1M    5.11ms ± 5%  1.26ms ± 7%  -75.29%  (p=0.000 n=48+60)

AVX512:
name                       old cpu/op   new cpu/op   delta
BM_eigen_erfc_float/1      2.83ns ± 4%  3.27ns ± 1%  +15.60%  (p=0.000 n=55+47)
BM_eigen_erfc_float/8      13.0ns ± 0%  15.9ns ± 0%  +22.12%  (p=0.000 n=51+57)
BM_eigen_erfc_float/64     92.1ns ± 1%  38.4ns ± 9%  -58.30%  (p=0.000 n=51+50)
BM_eigen_erfc_float/512     730ns ± 2%   100ns ± 6%  -86.32%  (p=0.000 n=58+59)
BM_eigen_erfc_float/4k     5.80µs ± 1%  0.62µs ± 5%  -89.25%  (p=0.000 n=58+59)
BM_eigen_erfc_float/32k    46.7µs ± 1%   5.5µs ± 7%  -88.23%  (p=0.000 n=57+60)
BM_eigen_erfc_float/256k    374µs ± 0%    81µs ± 3%  -78.31%  (p=0.000 n=41+59)
BM_eigen_erfc_float/1M     1.49ms ± 1%  0.32ms ± 3%  -78.28%  (p=0.000 n=51+49)
BM_eigen_erfc_double/1     3.00ns ± 0%  3.00ns ± 0%     ~     (p=0.441 n=52+48)
BM_eigen_erfc_double/8     37.8ns ± 1%  21.1ns ± 0%  -44.27%  (p=0.000 n=43+49)
BM_eigen_erfc_double/64     325ns ± 3%    66ns ±10%  -79.62%  (p=0.000 n=45+58)
BM_eigen_erfc_double/512   2.57µs ± 2%  0.38µs ± 7%  -85.17%  (p=0.000 n=48+50)
BM_eigen_erfc_double/4k    20.6µs ± 1%   3.0µs ± 7%  -85.37%  (p=0.000 n=49+53)
BM_eigen_erfc_double/32k    165µs ± 0%    24µs ± 8%  -85.52%  (p=0.000 n=46+55)
BM_eigen_erfc_double/256k  1.32ms ± 1%  0.22ms ± 5%  -83.58%  (p=0.000 n=49+57)
BM_eigen_erfc_double/1M    5.36ms ± 4%  0.89ms ±10%  -83.46%  (p=0.000 n=58+60)
Edited by Rasmus Munk Larsen

Merge request reports

Loading