Subpar matrix multiplication performance is not linear with the number of columns on the left side.

Summary

The performance of matrix multiplication between Eigen::Matrices differs greatly when compiling with and without the "-mfma" flag. We see that the performance for both column and row major matrix multiplication is not increasing linearly with the number of columns of the left matrix.

Environment

Operating System : Linux
Architecture : x86_64
Eigen Version : 3.4.0
Compiler Version : g++ (GCC) 11.3.0
Vector Extension : AVX SSE, SEE2, SSE3, SSSE3, SSE4.1, SSE4.2

Minimal Example

Multiplying column major matrices:

Number cols	with "-mfma" (ms)	without "-mfma" (ms)
1	2.25866	2.3034
2	16.4197	6.11715
3	16.8995	7.50639
4	15.8654	6.277783

Multiplying row major matrices:

Number cols	with "-mfma" (ms)	without "-mfma" (ms)
1	2.23107	2.32525
2	17.6265	6.07003
3	18.4446	7.44619
4	17.5746	6.20737

The program used to generate these results is the following: profileGEMM.cpp.

This is the script used to compile the above cpp program:

#!/bin/bash
export LD_LIBRARY_PATH=eigen-3.4.0/

g++ -I$LD_LIBRARY_PATH profileGEMM.cpp -o bench -mavx2 -mfma -O3
g++ -I$LD_LIBRARY_PATH profileGEMM.cpp -o bench_nofma -mavx2 -O3

Steps to reproduce

Execute the compiling script.
Execute bench executable by doing ./bench.
Execute bench executable by doing ./bench_nofma.

What is the current bug behavior?

The time taken for matrix multiplication is not increasing linearly with the number of columns of the left matrix. The matrix multiplication performance when compiling with "-mfma" flag is slower than when said flag is not used.

What is the expected correct behavior?

The time taken for matrix multiplication should increase linearly with the number of columns of the left matrix. The matrix multiplication performance when compiling with "-mfma" flag should be comparable to the time obtained when compiling without the flag.

Relevant logs

The program profileGEMM.cpp also produces the following logs:

EIGEN_VECTORIZE_FMA
EIGEN_HAS_SINGLE_INSTRUCTION_MADD
AVX SSE, SEE2, SSE3, SSSE3, SSE4.1, SSE4.2
Version: 3.4.0

Edited Apr 11, 2023 by Matias Lin