Better dot products

Reference issue

What does this implement/fix?

Currently, inner products are implemented as a compound expression, i.e. a.conjugate().cwiseProduct(b).sum(). This is OK, but precludes the use of fused multiply add instructions, which are ideal for inner products. This MR adds a new inner product evaluator which permits direct reduction operations on two operands.

The new evaluator is used for the dot product (conjugated inner product) invoked by dot(), and the non-conjugated inner product invoked by a matrix product that results in a 1x1 scalar.

MatrixXf a(10,10);
float conjugated_inner_product = a.row(4).dot(a.col(4));
float non_conjugated_inner_product = a.row(4) * a.col(4);

I implemented an explicit unrolling scheme which appears to have very little impact on performance, on my mainstream desktop machine. However, for small, fixed size vectors in a tight loop, maybe its worthwhile? I set an arbitrary cutoff of SizeAtCompileTime <= 32 for the unrolling, just to prevent giant unrolled loops.

Additional information

Measured speedup for AVX2+FMA:

name                        old cpu/op   new cpu/op   delta
BM_eigen_dot<float>/1       0.35ns ± 0%  0.70ns ± 0%  +101.84%  (p=0.000 n=47+51)
BM_eigen_dot<float>/8       1.21ns ± 4%  1.37ns ± 0%   +13.28%  (p=0.000 n=59+54)
BM_eigen_dot<float>/64      4.42ns ± 7%  3.89ns ± 2%   -11.92%  (p=0.000 n=57+47)
BM_eigen_dot<float>/512     33.7ns ± 4%  24.7ns ± 3%   -26.63%  (p=0.000 n=44+55)
BM_eigen_dot<float>/4k       306ns ± 3%   165ns ± 3%   -46.25%  (p=0.000 n=54+52)
BM_eigen_dot<float>/32k     2.67µs ± 4%  2.75µs ± 3%    +2.88%  (p=0.000 n=53+55)
BM_eigen_dot<float>/256k    80.6µs ± 2%  78.1µs ± 1%    -3.01%  (p=0.000 n=60+56)
BM_eigen_dot<float>/1M       322µs ± 1%   312µs ± 2%    -3.14%  (p=0.000 n=48+47)
BM_eigen_dot<double>/1      0.35ns ± 0%  0.70ns ± 0%  +102.06%  (p=0.000 n=46+58)
BM_eigen_dot<double>/8      1.94ns ± 3%  1.92ns ± 1%    -0.90%  (p=0.000 n=56+56)
BM_eigen_dot<double>/64     7.72ns ± 5%  6.27ns ± 3%   -18.80%  (p=0.000 n=51+57)
BM_eigen_dot<double>/512    70.8ns ± 4%  42.5ns ± 3%   -39.91%  (p=0.000 n=59+57)
BM_eigen_dot<double>/4k      671ns ± 3%   688ns ± 2%    +2.56%  (p=0.000 n=60+57)
BM_eigen_dot<double>/32k    5.35µs ± 4%  5.49µs ± 4%    +2.63%  (p=0.000 n=59+59)
BM_eigen_dot<double>/256k    161µs ± 2%   156µs ± 2%    -3.26%  (p=0.000 n=58+58)
BM_eigen_dot<double>/1M      667µs ±10%   655µs ±14%    -1.72%  (p=0.042 n=59+52)
Edited by Rasmus Munk Larsen

Merge request reports

Loading