Better dot products
Reference issue
What does this implement/fix?
Currently, inner products are implemented as a compound expression, i.e. a.conjugate().cwiseProduct(b).sum(). This is OK, but precludes the use of fused multiply add instructions, which are ideal for inner products. This MR adds a new inner product evaluator which permits direct reduction operations on two operands.
The new evaluator is used for the dot product (conjugated inner product) invoked by dot(), and the non-conjugated inner product invoked by a matrix product that results in a 1x1 scalar.
MatrixXf a(10,10);
float conjugated_inner_product = a.row(4).dot(a.col(4));
float non_conjugated_inner_product = a.row(4) * a.col(4);
I implemented an explicit unrolling scheme which appears to have very little impact on performance, on my mainstream desktop machine. However, for small, fixed size vectors in a tight loop, maybe its worthwhile? I set an arbitrary cutoff of SizeAtCompileTime <= 32 for the unrolling, just to prevent giant unrolled loops.
Additional information
Measured speedup for AVX2+FMA:
name old cpu/op new cpu/op delta
BM_eigen_dot<float>/1 0.35ns ± 0% 0.70ns ± 0% +101.84% (p=0.000 n=47+51)
BM_eigen_dot<float>/8 1.21ns ± 4% 1.37ns ± 0% +13.28% (p=0.000 n=59+54)
BM_eigen_dot<float>/64 4.42ns ± 7% 3.89ns ± 2% -11.92% (p=0.000 n=57+47)
BM_eigen_dot<float>/512 33.7ns ± 4% 24.7ns ± 3% -26.63% (p=0.000 n=44+55)
BM_eigen_dot<float>/4k 306ns ± 3% 165ns ± 3% -46.25% (p=0.000 n=54+52)
BM_eigen_dot<float>/32k 2.67µs ± 4% 2.75µs ± 3% +2.88% (p=0.000 n=53+55)
BM_eigen_dot<float>/256k 80.6µs ± 2% 78.1µs ± 1% -3.01% (p=0.000 n=60+56)
BM_eigen_dot<float>/1M 322µs ± 1% 312µs ± 2% -3.14% (p=0.000 n=48+47)
BM_eigen_dot<double>/1 0.35ns ± 0% 0.70ns ± 0% +102.06% (p=0.000 n=46+58)
BM_eigen_dot<double>/8 1.94ns ± 3% 1.92ns ± 1% -0.90% (p=0.000 n=56+56)
BM_eigen_dot<double>/64 7.72ns ± 5% 6.27ns ± 3% -18.80% (p=0.000 n=51+57)
BM_eigen_dot<double>/512 70.8ns ± 4% 42.5ns ± 3% -39.91% (p=0.000 n=59+57)
BM_eigen_dot<double>/4k 671ns ± 3% 688ns ± 2% +2.56% (p=0.000 n=60+57)
BM_eigen_dot<double>/32k 5.35µs ± 4% 5.49µs ± 4% +2.63% (p=0.000 n=59+59)
BM_eigen_dot<double>/256k 161µs ± 2% 156µs ± 2% -3.26% (p=0.000 n=58+58)
BM_eigen_dot<double>/1M 667µs ±10% 655µs ±14% -1.72% (p=0.042 n=59+52)