Skip to content

Improve speed and accuracy of erf()

This reduces the maximum error from 4 to 3 ulps, and nets the following speedups:

AVX2+FMA:

name                      old cpu/op   new cpu/op   delta
BM_eigen_erf_float/64     46.1ns ± 6%  34.8ns ± 4%  -24.40%  (p=0.000 n=58+46)
BM_eigen_erf_float/512     289ns ± 6%   191ns ± 4%  -33.80%  (p=0.000 n=48+59)
BM_eigen_erf_float/4k     2.24µs ± 4%  1.44µs ± 4%  -35.49%  (p=0.000 n=51+59)
BM_eigen_erf_float/32k    17.8µs ± 5%  11.5µs ± 5%  -35.27%  (p=0.000 n=50+60)
BM_eigen_erf_float/256k    142µs ± 6%   101µs ± 4%  -28.80%  (p=0.000 n=55+59)
BM_eigen_erf_float/1M      567µs ± 4%   404µs ± 4%  -28.81%  (p=0.000 n=54+47)


SSE 4.2:
name                      old cpu/op   new cpu/op   delta
BM_eigen_erf_float/64     64.5ns ± 6%  55.6ns ± 6%  -13.72%  (p=0.000 n=56+58)
BM_eigen_erf_float/512     381ns ± 4%   333ns ± 3%  -12.56%  (p=0.000 n=47+49)
BM_eigen_erf_float/4k     2.93µs ± 5%  2.56µs ± 3%  -12.80%  (p=0.000 n=53+54)
BM_eigen_erf_float/32k    23.4µs ± 6%  20.3µs ± 4%  -13.15%  (p=0.000 n=55+58)
BM_eigen_erf_float/256k    187µs ± 6%   163µs ± 3%  -12.94%  (p=0.000 n=56+60)
BM_eigen_erf_float/1M      742µs ± 6%   652µs ± 3%  -12.14%  (p=0.000 n=59+59)

Merge request reports

Loading