Subpar matrix multiplication performance is not linear with the number of columns on the left side.
Summary
The performance of matrix multiplication between Eigen::Matrices differs greatly when compiling with and without the "-mfma" flag. We see that the performance for both column and row major matrix multiplication is not increasing linearly with the number of columns of the left matrix.
Environment
- Operating System : Linux
- Architecture : x86_64
- Eigen Version : 3.4.0
- Compiler Version : g++ (GCC) 11.3.0
- Vector Extension : AVX SSE, SEE2, SSE3, SSSE3, SSE4.1, SSE4.2
Minimal Example
Multiplying column major matrices:
| Number cols | with "-mfma" (ms) | without "-mfma" (ms) |
|---|---|---|
| 1 | 2.25866 | 2.3034 |
| 2 | 16.4197 | 6.11715 |
| 3 | 16.8995 | 7.50639 |
| 4 | 15.8654 | 6.277783 |
Multiplying row major matrices:
| Number cols | with "-mfma" (ms) | without "-mfma" (ms) |
|---|---|---|
| 1 | 2.23107 | 2.32525 |
| 2 | 17.6265 | 6.07003 |
| 3 | 18.4446 | 7.44619 |
| 4 | 17.5746 | 6.20737 |
The program used to generate these results is the following: profileGEMM.cpp.
This is the script used to compile the above cpp program:
#!/bin/bash
export LD_LIBRARY_PATH=eigen-3.4.0/
g++ -I$LD_LIBRARY_PATH profileGEMM.cpp -o bench -mavx2 -mfma -O3
g++ -I$LD_LIBRARY_PATH profileGEMM.cpp -o bench_nofma -mavx2 -O3
Steps to reproduce
- Execute the compiling script.
- Execute bench executable by doing
./bench. - Execute bench executable by doing
./bench_nofma.
What is the current bug behavior?
The time taken for matrix multiplication is not increasing linearly with the number of columns of the left matrix. The matrix multiplication performance when compiling with "-mfma" flag is slower than when said flag is not used.
What is the expected correct behavior?
The time taken for matrix multiplication should increase linearly with the number of columns of the left matrix. The matrix multiplication performance when compiling with "-mfma" flag should be comparable to the time obtained when compiling without the flag.
Relevant logs
The program profileGEMM.cpp also produces the following logs:
EIGEN_VECTORIZE_FMA
EIGEN_HAS_SINGLE_INSTRUCTION_MADD
AVX SSE, SEE2, SSE3, SSSE3, SSE4.1, SSE4.2
Version: 3.4.0