Slow fp16 matrix multiplication performance on CPU with avx512_fp16 support
Summary
Hello, I've seen master branch added AVX512FP16 support last year so i did some test, but found there's some strange performance result.
The fp16(type _Float16) matrix multiplication is pretty slow on CPU with avx512_fp16 support(tested on Intel(R) Xeon(R) Platinum 8480+)
May I ask is this as expected? Thanks in advance!
Environment
- Operating System : Linux
- Architecture : x64
- Eigen Version : master
- Compiler Version : Gcc13.2.0
- Compile Flags : -O3 -march=native -mavx512f -mavx512fp16 -mavx512vl -ffast-math
- Vector Extension : AVX512f AVX512FP16
Minimal Example
template<typename MatA, typename MatB>
MatA calcMat(const MatA& p_MatA, const MatB& p_MatB)
{
return p_MatA * p_MatB;
}
tested with combination:
float:
MatA = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>
MatB = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>
_Float16:
MatA = Eigen::Matrix<_Float16, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>
MatB = Eigen::Matrix<_Float16, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>
half:
MatA = Eigen::Matrix<Eigen::half, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>
MatB = Eigen::Matrix<Eigen::half, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>
Steps to reproduce
- first step
- second step
- ...
What is the current bug behavior?
When MatA is 1000000x768 and MatB is 768x500:
_Float16 matrix multiplication uses more than 10 times of time compares to float matrix(87.5242sec vs 6.75557sec)
Eigen::Half uses 2 times of time compares to float matrix(13.1653sec vs 6.75557sec)
What is the expected correct behavior?
_Float16 calculation may should cost about half time compares to float, as normally AVX512FP16 instruction operations uses half time compares to corresponding AVX512f instruction
Benchmark scripts and results
typedef Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatrixXf;
typedef Eigen::Matrix<_Float16, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatrixXf16;
typedef Eigen::Matrix<Eigen::half, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatrixXHalf;
typedef Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor> MatrixXfCol;
typedef Eigen::Matrix<_Float16, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor> MatrixXf16Col;
typedef Eigen::Matrix<Eigen::half, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor> MatrixXHalfCol;
template<typename MatA, typename MatB>
MatA calcMat(const MatA& p_MatA, const MatB& p_MatB)
{
return p_MatA * p_MatB;
}
float getRandomFloat(const float amin, const float amax) noexcept
{
return amin + (static_cast<float>(std::rand()) / RAND_MAX) * (amax - amin);
}
void testRun(size_t MatARowNum, size_t MatBRowNum)
{
std::srand(0);
size_t colNum = 768;
MatrixXf MatA(MatARowNum, colNum);
MatrixXf16 MatA16(MatARowNum, colNum);
MatrixXHalf MatAHalf(MatARowNum, colNum);
MatrixXfCol MatBColMajor(colNum, MatBRowNum);
MatrixXf16Col MatB16ColMajor(colNum, MatBRowNum);
MatrixXHalfCol MatBHalfColMajor(colNum, MatBRowNum);
for(size_t row = 0; row < MatARowNum; row++)
{
for(size_t col = 0; col < colNum; col++)
{
float val = getRandomFloat(0.0, 1.0);
MatA(row, col) = val;
MatA16(row, col) = _Float16(val);
MatAHalf(row, col) = Eigen::half(val);
}
}
for(size_t row = 0; row < MatBRowNum; row++)
{
for(size_t col = 0; col < colNum; col++)
{
float val = getRandomFloat(0.0, 1.0);
MatBColMajor(col, row) = val;
MatB16ColMajor(col, row) = _Float16(val);
MatBHalfColMajor(col, row) = Eigen::half(val);
}
}
{
auto start = std::chrono::system_clock::now();
auto res = calcMat<MatrixXf, MatrixXfCol>(MatA, MatBColMajor);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "eigen float32 time used: " << elapsed_seconds.count() << std::endl;
}
{
auto start = std::chrono::system_clock::now();
auto res = calcMat<MatrixXf16, MatrixXf16Col>(MatA16, MatB16ColMajor);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "eigen fp16 time used: " << elapsed_seconds.count() << std::endl;
}
{
auto start = std::chrono::system_clock::now();
auto res = calcMat<MatrixXHalf, MatrixXHalfCol>(MatAHalf, MatBHalfColMajor);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "eigen Half time used: " << elapsed_seconds.count() << std::endl;
}
}
Then call testRun(1000000, 500)
Anything else that might help
I've also tested some other matrix operation like:
template<typename MatA, typename MatB>
MatA calcVec(const MatA& p_MatA, const MatB& p_MatB)
{
return (p_MatA.rowwise() - p_MatB).rowwise().squaredNorm().transpose();
}
And given MatA is 10000000x768, MatB is 1x768, both RowMajor,
This time float cost 2.36102sec, _Float16 1.24202sec, Eigen::half 54.0216sec
-
Have a plan to fix this issue.