Skip to content

Vectorize erfc() for float

This adds a vectorized implementation of the complementary error function. The implementation is accurate to 34 ulps across the entire range and 5 ulps for |x| < 1.

Benchmark measurements compared to the old implementation using std::erfc():

SSE 4.2, |x| < 1:
name                      old cpu/op   new cpu/op   delta
BM_eigen_erfc_float/8     39.5ns ± 0%  21.5ns ± 0%  -45.56%  (p=0.000 n=41+52)
BM_eigen_erfc_float/64     336ns ± 1%   174ns ± 0%  -48.26%  (p=0.000 n=42+51)
BM_eigen_erfc_float/512   2.67µs ± 1%  1.40µs ± 0%  -47.78%  (p=0.000 n=49+50)
BM_eigen_erfc_float/4k    21.4µs ± 1%  11.2µs ± 0%  -47.75%  (p=0.000 n=52+53)
BM_eigen_erfc_float/32k    171µs ± 1%    89µs ± 1%  -47.82%  (p=0.000 n=54+55)
BM_eigen_erfc_float/256k  1.37ms ± 1%  0.71ms ± 0%  -47.81%  (p=0.000 n=49+50)
BM_eigen_erfc_float/1M    5.47ms ± 1%  2.86ms ± 0%  -47.78%  (p=0.000 n=52+45)

SSE 4.2, |x| > 1:
name                      old cpu/op   new cpu/op   delta
BM_eigen_erfc_float/64    1.42µs ± 1%  0.74µs ± 1%  -48.27%  (p=0.000 n=53+47)
BM_eigen_erfc_float/512   11.5µs ± 1%   5.9µs ± 4%  -48.28%  (p=0.000 n=55+51)
BM_eigen_erfc_float/4k    92.1µs ± 3%  48.9µs ± 9%  -46.89%  (p=0.000 n=51+60)
BM_eigen_erfc_float/32k    739µs ± 2%   389µs ± 9%  -47.30%  (p=0.000 n=51+48)
BM_eigen_erfc_float/256k  5.92ms ± 1%  3.06ms ± 1%  -48.27%  (p=0.000 n=52+42)
BM_eigen_erfc_float/1M    23.7ms ± 1%  12.3ms ± 4%  -47.92%  (p=0.000 n=41+52)

AVX2+FMA, |x| < 1:
name                      old cpu/op   new cpu/op   delta
BM_eigen_erfc_float/8     39.7ns ± 3%  13.3ns ± 0%  -66.42%  (p=0.000 n=47+51)
BM_eigen_erfc_float/64     337ns ± 2%    92ns ± 0%  -72.68%  (p=0.000 n=41+54)
BM_eigen_erfc_float/512   2.68µs ± 1%  0.73µs ± 0%  -72.88%  (p=0.000 n=49+57)
BM_eigen_erfc_float/4k    21.4µs ± 1%   5.8µs ± 0%  -72.77%  (p=0.000 n=52+54)
BM_eigen_erfc_float/32k    171µs ± 1%    46µs ± 1%  -72.90%  (p=0.000 n=49+56)
BM_eigen_erfc_float/256k  1.37ms ± 1%  0.37ms ± 1%  -72.91%  (p=0.000 n=53+45)
BM_eigen_erfc_float/1M    5.48ms ± 1%  1.49ms ± 1%  -72.84%  (p=0.000 n=54+56)

AVX2+FMA, |x| > 1:
name                      old cpu/op   new cpu/op   delta
BM_eigen_erfc_float/8      164ns ± 3%    65ns ± 4%  -60.55%  (p=0.000 n=48+57)
BM_eigen_erfc_float/64    1.42µs ± 1%  0.55µs ± 3%  -61.29%  (p=0.000 n=47+58)
BM_eigen_erfc_float/512   11.5µs ± 2%   4.4µs ± 5%  -61.46%  (p=0.000 n=47+60)
BM_eigen_erfc_float/4k    91.9µs ± 1%  35.4µs ± 3%  -61.49%  (p=0.000 n=46+49)
BM_eigen_erfc_float/32k    738µs ± 2%   284µs ± 3%  -61.48%  (p=0.000 n=47+54)
BM_eigen_erfc_float/256k  5.92ms ± 4%  2.28ms ± 3%  -61.54%  (p=0.000 n=48+60)
BM_eigen_erfc_float/1M    23.8ms ± 6%   9.1ms ± 3%  -61.72%  (p=0.000 n=46+60)
Edited by Rasmus Munk Larsen

Merge request reports

Loading