optimize new dot product
Reference issue
What does this implement/fix?
Fine tunes the new dot product with the following changes:
- cheaper bounds calculations
- use branches to avoid vector code if its not needed
- simpler scalar loop
Overall, this sacrifices a tiny bit of performance for humongous sizes in favor of better performance for smaller sizes.
Additional information
Benchmark relative to the current version, measured with FMA + AVX2 on SkylakeX:
name old cpu/op new cpu/op delta
BM_eigen_dot<float>/1 1.09ns ± 0% 0.70ns ± 1% -35.79% (p=0.000 n=58+55)
BM_eigen_dot<float>/2 2.45ns ± 0% 2.19ns ± 0% -10.95% (p=0.000 n=54+54)
BM_eigen_dot<float>/3 3.00ns ± 0% 2.66ns ± 8% -11.19% (p=0.000 n=54+55)
BM_eigen_dot<float>/4 3.54ns ± 0% 2.73ns ± 1% -22.90% (p=0.000 n=49+52)
BM_eigen_dot<float>/7 5.18ns ± 0% 4.36ns ± 1% -15.79% (p=0.000 n=56+53)
BM_eigen_dot<float>/8 2.08ns ± 4% 1.37ns ± 1% -34.24% (p=0.000 n=60+55)
BM_eigen_dot<float>/9 2.38ns ± 3% 1.64ns ± 1% -31.07% (p=0.000 n=55+54)
BM_eigen_dot<float>/16 2.38ns ± 3% 1.79ns ± 3% -24.66% (p=0.000 n=56+57)
BM_eigen_dot<float>/20 3.58ns ± 2% 3.48ns ± 1% -2.76% (p=0.000 n=50+47)
BM_eigen_dot<float>/25 2.67ns ± 3% 2.48ns ± 3% -7.19% (p=0.000 n=55+50)
BM_eigen_dot<float>/32 2.97ns ± 3% 2.68ns ± 3% -9.69% (p=0.000 n=55+54)
BM_eigen_dot<float>/64 3.87ns ± 3% 4.02ns ± 6% +3.88% (p=0.000 n=49+50)
BM_eigen_dot<float>/128 7.11ns ± 3% 7.14ns ± 3% ~ (p=0.109 n=59+60)
BM_eigen_dot<float>/256 12.2ns ± 4% 12.8ns ± 4% +5.03% (p=0.000 n=60+58)
BM_eigen_dot<float>/512 24.0ns ± 3% 25.2ns ± 3% +5.04% (p=0.000 n=55+54)
BM_eigen_dot<float>/1k 42.8ns ± 3% 44.2ns ± 3% +3.36% (p=0.000 n=56+59)
BM_eigen_dot<float>/32k 2.74µs ± 3% 2.74µs ± 4% ~ (p=0.747 n=55+55)
BM_eigen_dot<float>/512k 157µs ± 2% 156µs ± 3% -0.61% (p=0.001 n=58+60)
BM_eigen_dot<float>/1M 314µs ± 2% 313µs ± 3% -0.42% (p=0.046 n=50+50)
Edited by Rasmus Munk Larsen