optimize new dot product

Reference issue

What does this implement/fix?

Fine tunes the new dot product with the following changes:

  • cheaper bounds calculations
  • use branches to avoid vector code if its not needed
  • simpler scalar loop

Overall, this sacrifices a tiny bit of performance for humongous sizes in favor of better performance for smaller sizes.

Additional information

Benchmark relative to the current version, measured with FMA + AVX2 on SkylakeX:

name                       old cpu/op   new cpu/op   delta
BM_eigen_dot<float>/1      1.09ns ± 0%  0.70ns ± 1%  -35.79%  (p=0.000 n=58+55)
BM_eigen_dot<float>/2      2.45ns ± 0%  2.19ns ± 0%  -10.95%  (p=0.000 n=54+54)
BM_eigen_dot<float>/3      3.00ns ± 0%  2.66ns ± 8%  -11.19%  (p=0.000 n=54+55)
BM_eigen_dot<float>/4      3.54ns ± 0%  2.73ns ± 1%  -22.90%  (p=0.000 n=49+52)
BM_eigen_dot<float>/7      5.18ns ± 0%  4.36ns ± 1%  -15.79%  (p=0.000 n=56+53)
BM_eigen_dot<float>/8      2.08ns ± 4%  1.37ns ± 1%  -34.24%  (p=0.000 n=60+55)
BM_eigen_dot<float>/9      2.38ns ± 3%  1.64ns ± 1%  -31.07%  (p=0.000 n=55+54)
BM_eigen_dot<float>/16     2.38ns ± 3%  1.79ns ± 3%  -24.66%  (p=0.000 n=56+57)
BM_eigen_dot<float>/20     3.58ns ± 2%  3.48ns ± 1%   -2.76%  (p=0.000 n=50+47)
BM_eigen_dot<float>/25     2.67ns ± 3%  2.48ns ± 3%   -7.19%  (p=0.000 n=55+50)
BM_eigen_dot<float>/32     2.97ns ± 3%  2.68ns ± 3%   -9.69%  (p=0.000 n=55+54)
BM_eigen_dot<float>/64     3.87ns ± 3%  4.02ns ± 6%   +3.88%  (p=0.000 n=49+50)
BM_eigen_dot<float>/128    7.11ns ± 3%  7.14ns ± 3%     ~     (p=0.109 n=59+60)
BM_eigen_dot<float>/256    12.2ns ± 4%  12.8ns ± 4%   +5.03%  (p=0.000 n=60+58)
BM_eigen_dot<float>/512    24.0ns ± 3%  25.2ns ± 3%   +5.04%  (p=0.000 n=55+54)
BM_eigen_dot<float>/1k     42.8ns ± 3%  44.2ns ± 3%   +3.36%  (p=0.000 n=56+59)
BM_eigen_dot<float>/32k    2.74µs ± 3%  2.74µs ± 4%     ~     (p=0.747 n=55+55)
BM_eigen_dot<float>/512k    157µs ± 2%   156µs ± 3%   -0.61%  (p=0.001 n=58+60)
BM_eigen_dot<float>/1M      314µs ± 2%   313µs ± 3%   -0.42%  (p=0.046 n=50+50)
Edited by Rasmus Munk Larsen

Merge request reports

Loading