Skip to content

Speed up complex * complex matrix multiplication.

This fixes an old TODO to make the block panel size in the m direction depend on the number of registers for complex * complex. It speeds up complex * complex matrix multiplication by 8-33%, depending on the type and backend (measured for std::complex<{float|double}> with SSE and AVX2).

Benchmark code:

template<typename T>
void BM_MatMul(benchmark::State& state) {
  using Matrix =
      Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
  int n = state.range(0);
  Matrix a(n, n), b(n, n), c(n, n);
  a.setRandom();
  b.setRandom();
  c.setZero();
  for (auto s : state) {
    c.noalias() += a * b;
  }
}

Measurements:

SSE4.2:

name                                old cpu/op   new cpu/op   delta
BM_MatMul<std::complex<float>>/8     361ns ± 1%   294ns ± 0%  -18.47%  (p=0.000 n=48+46)
BM_MatMul<std::complex<float>>/32   13.4µs ± 2%  11.1µs ± 0%  -17.09%  (p=0.000 n=60+58)
BM_MatMul<std::complex<float>>/64   97.2µs ± 2%  83.0µs ± 0%  -14.60%  (p=0.000 n=49+57)
BM_MatMul<std::complex<float>>/128   735µs ± 2%   636µs ± 0%  -13.45%  (p=0.000 n=48+57)
BM_MatMul<std::complex<float>>/256  5.78ms ± 3%  5.02ms ± 0%  -13.09%  (p=0.000 n=48+58)
BM_MatMul<std::complex<float>>/512  46.5ms ± 3%  39.9ms ± 0%  -14.11%  (p=0.000 n=50+33)
BM_MatMul<std::complex<float>>/1k    371ms ± 1%   320ms ± 1%  -13.69%   (p=0.000 n=8+10)

BM_MatMul<std::complex<double>>/8     556ns ± 0%   507ns ± 0%  -8.67%  (p=0.000 n=53+55)
BM_MatMul<std::complex<double>>/32   23.5µs ± 1%  21.5µs ± 1%  -8.66%  (p=0.000 n=55+58)
BM_MatMul<std::complex<double>>/64    179µs ± 0%   162µs ± 1%  -9.92%  (p=0.000 n=57+57)
BM_MatMul<std::complex<double>>/128  1.39ms ± 1%  1.26ms ± 1%  -8.85%  (p=0.000 n=58+52)
BM_MatMul<std::complex<double>>/256  10.9ms ± 1%  10.1ms ± 1%  -7.69%  (p=0.000 n=52+52)
BM_MatMul<std::complex<double>>/512  89.0ms ± 2%  80.9ms ± 2%  -9.08%  (p=0.000 n=39+44)
BM_MatMul<std::complex<double>>/1k    708ms ± 1%   645ms ± 1%  -8.90%    (p=0.008 n=5+5)
AVX2:

name                                old cpu/op   new cpu/op   delta
BM_MatMul<std::complex<float>>/8     207ns ± 3%   181ns ± 3%  -12.70%  (p=0.000 n=60+59)
BM_MatMul<std::complex<float>>/32   6.67µs ± 3%  4.53µs ± 3%  -32.15%  (p=0.000 n=59+60)
BM_MatMul<std::complex<float>>/64   50.1µs ± 4%  33.7µs ± 3%  -32.74%  (p=0.000 n=60+49)
BM_MatMul<std::complex<float>>/128   387µs ± 3%   256µs ± 4%  -33.80%  (p=0.000 n=44+54)
BM_MatMul<std::complex<float>>/256  3.12ms ± 4%  2.08ms ± 3%  -33.40%  (p=0.000 n=46+53)
BM_MatMul<std::complex<float>>/512  24.7ms ± 3%  16.4ms ± 4%  -33.52%  (p=0.000 n=40+45)
BM_MatMul<std::complex<float>>/1k    198ms ± 4%   133ms ± 4%  -33.08%  (p=0.000 n=19+25)

BM_MatMul<std::complex<double>>/8     357ns ± 4%   283ns ± 4%  -20.51%  (p=0.000 n=49+55)
BM_MatMul<std::complex<double>>/32   13.3µs ± 4%   9.2µs ± 4%  -30.67%  (p=0.000 n=58+59)
BM_MatMul<std::complex<double>>/64    100µs ± 3%    68µs ± 3%  -32.60%  (p=0.000 n=56+60)
BM_MatMul<std::complex<double>>/128   768µs ± 3%   526µs ± 3%  -31.56%  (p=0.000 n=58+60)
BM_MatMul<std::complex<double>>/256  6.07ms ± 3%  4.25ms ± 4%  -30.00%  (p=0.000 n=58+51)
BM_MatMul<std::complex<double>>/512  49.2ms ± 6%  34.1ms ± 7%  -30.83%  (p=0.000 n=49+30)
BM_MatMul<std::complex<double>>/1k    400ms ± 2%   278ms ± 4%  -30.40%   (p=0.000 n=9+10)
Edited by Rasmus Munk Larsen

Merge request reports

Loading