Speed up complex * complex matrix multiplication.
This fixes an old TODO to make the block panel size in the m
direction depend on the number of registers for complex * complex. It speeds up complex * complex matrix multiplication by 8-33%, depending on the type and backend (measured for std::complex<{float|double}>
with SSE and AVX2).
Benchmark code:
template<typename T>
void BM_MatMul(benchmark::State& state) {
using Matrix =
Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
int n = state.range(0);
Matrix a(n, n), b(n, n), c(n, n);
a.setRandom();
b.setRandom();
c.setZero();
for (auto s : state) {
c.noalias() += a * b;
}
}
Measurements:
SSE4.2:
name old cpu/op new cpu/op delta
BM_MatMul<std::complex<float>>/8 361ns ± 1% 294ns ± 0% -18.47% (p=0.000 n=48+46)
BM_MatMul<std::complex<float>>/32 13.4µs ± 2% 11.1µs ± 0% -17.09% (p=0.000 n=60+58)
BM_MatMul<std::complex<float>>/64 97.2µs ± 2% 83.0µs ± 0% -14.60% (p=0.000 n=49+57)
BM_MatMul<std::complex<float>>/128 735µs ± 2% 636µs ± 0% -13.45% (p=0.000 n=48+57)
BM_MatMul<std::complex<float>>/256 5.78ms ± 3% 5.02ms ± 0% -13.09% (p=0.000 n=48+58)
BM_MatMul<std::complex<float>>/512 46.5ms ± 3% 39.9ms ± 0% -14.11% (p=0.000 n=50+33)
BM_MatMul<std::complex<float>>/1k 371ms ± 1% 320ms ± 1% -13.69% (p=0.000 n=8+10)
BM_MatMul<std::complex<double>>/8 556ns ± 0% 507ns ± 0% -8.67% (p=0.000 n=53+55)
BM_MatMul<std::complex<double>>/32 23.5µs ± 1% 21.5µs ± 1% -8.66% (p=0.000 n=55+58)
BM_MatMul<std::complex<double>>/64 179µs ± 0% 162µs ± 1% -9.92% (p=0.000 n=57+57)
BM_MatMul<std::complex<double>>/128 1.39ms ± 1% 1.26ms ± 1% -8.85% (p=0.000 n=58+52)
BM_MatMul<std::complex<double>>/256 10.9ms ± 1% 10.1ms ± 1% -7.69% (p=0.000 n=52+52)
BM_MatMul<std::complex<double>>/512 89.0ms ± 2% 80.9ms ± 2% -9.08% (p=0.000 n=39+44)
BM_MatMul<std::complex<double>>/1k 708ms ± 1% 645ms ± 1% -8.90% (p=0.008 n=5+5)
AVX2:
name old cpu/op new cpu/op delta
BM_MatMul<std::complex<float>>/8 207ns ± 3% 181ns ± 3% -12.70% (p=0.000 n=60+59)
BM_MatMul<std::complex<float>>/32 6.67µs ± 3% 4.53µs ± 3% -32.15% (p=0.000 n=59+60)
BM_MatMul<std::complex<float>>/64 50.1µs ± 4% 33.7µs ± 3% -32.74% (p=0.000 n=60+49)
BM_MatMul<std::complex<float>>/128 387µs ± 3% 256µs ± 4% -33.80% (p=0.000 n=44+54)
BM_MatMul<std::complex<float>>/256 3.12ms ± 4% 2.08ms ± 3% -33.40% (p=0.000 n=46+53)
BM_MatMul<std::complex<float>>/512 24.7ms ± 3% 16.4ms ± 4% -33.52% (p=0.000 n=40+45)
BM_MatMul<std::complex<float>>/1k 198ms ± 4% 133ms ± 4% -33.08% (p=0.000 n=19+25)
BM_MatMul<std::complex<double>>/8 357ns ± 4% 283ns ± 4% -20.51% (p=0.000 n=49+55)
BM_MatMul<std::complex<double>>/32 13.3µs ± 4% 9.2µs ± 4% -30.67% (p=0.000 n=58+59)
BM_MatMul<std::complex<double>>/64 100µs ± 3% 68µs ± 3% -32.60% (p=0.000 n=56+60)
BM_MatMul<std::complex<double>>/128 768µs ± 3% 526µs ± 3% -31.56% (p=0.000 n=58+60)
BM_MatMul<std::complex<double>>/256 6.07ms ± 3% 4.25ms ± 4% -30.00% (p=0.000 n=58+51)
BM_MatMul<std::complex<double>>/512 49.2ms ± 6% 34.1ms ± 7% -30.83% (p=0.000 n=49+30)
BM_MatMul<std::complex<double>>/1k 400ms ± 2% 278ms ± 4% -30.40% (p=0.000 n=9+10)
Edited by Rasmus Munk Larsen