Slow fp16 matrix multiplication performance on CPU with avx512_fp16 support

Summary

Hello, I've seen master branch added AVX512FP16 support last year so i did some test, but found there's some strange performance result.

The fp16(type _Float16) matrix multiplication is pretty slow on CPU with avx512_fp16 support(tested on Intel(R) Xeon(R) Platinum 8480+)

May I ask is this as expected? Thanks in advance!

Environment

Operating System : Linux
Architecture : x64
Eigen Version : master
Compiler Version : Gcc13.2.0
Compile Flags : -O3 -march=native -mavx512f -mavx512fp16 -mavx512vl -ffast-math
Vector Extension : AVX512f AVX512FP16

Minimal Example

template<typename MatA, typename MatB>
MatA calcMat(const MatA& p_MatA, const MatB& p_MatB)
{
    return p_MatA * p_MatB;
}

tested with combination:

float:

MatA = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>

MatB = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>

_Float16:

MatA = Eigen::Matrix<_Float16, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>

MatB = Eigen::Matrix<_Float16, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>

half:

MatA = Eigen::Matrix<Eigen::half, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>

MatB = Eigen::Matrix<Eigen::half, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>

Steps to reproduce

first step
second step
...

What is the current bug behavior?

When MatA is 1000000x768 and MatB is 768x500:

_Float16 matrix multiplication uses more than 10 times of time compares to float matrix(87.5242sec vs 6.75557sec)

Eigen::Half uses 2 times of time compares to float matrix(13.1653sec vs 6.75557sec)

What is the expected correct behavior?

_Float16 calculation may should cost about half time compares to float, as normally AVX512FP16 instruction operations uses half time compares to corresponding AVX512f instruction

Benchmark scripts and results

typedef Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatrixXf;
typedef Eigen::Matrix<_Float16, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatrixXf16;
typedef Eigen::Matrix<Eigen::half, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatrixXHalf;

typedef Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor> MatrixXfCol;
typedef Eigen::Matrix<_Float16, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor> MatrixXf16Col;
typedef Eigen::Matrix<Eigen::half, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor> MatrixXHalfCol;

template<typename MatA, typename MatB>
MatA calcMat(const MatA& p_MatA, const MatB& p_MatB)
{
    return p_MatA * p_MatB;
}

float getRandomFloat(const float amin, const float amax) noexcept
{
    return amin + (static_cast<float>(std::rand()) / RAND_MAX) * (amax - amin);
}

void testRun(size_t MatARowNum, size_t MatBRowNum)
{
    std::srand(0);
    size_t colNum = 768;
    MatrixXf MatA(MatARowNum, colNum);
    MatrixXf16 MatA16(MatARowNum, colNum);
    MatrixXHalf MatAHalf(MatARowNum, colNum);

    MatrixXfCol MatBColMajor(colNum, MatBRowNum);
    MatrixXf16Col MatB16ColMajor(colNum, MatBRowNum);
    MatrixXHalfCol MatBHalfColMajor(colNum, MatBRowNum);

    for(size_t row = 0; row < MatARowNum; row++)
    {
        for(size_t col = 0; col < colNum; col++)
        {
            float val = getRandomFloat(0.0, 1.0);
            MatA(row, col) = val;
            MatA16(row, col) = _Float16(val);
            MatAHalf(row, col) = Eigen::half(val);
        }
    }

    for(size_t row = 0; row < MatBRowNum; row++)
    {
        for(size_t col = 0; col < colNum; col++)
        {
            float val = getRandomFloat(0.0, 1.0);
            MatBColMajor(col, row) = val;
            MatB16ColMajor(col, row) = _Float16(val);
            MatBHalfColMajor(col, row) = Eigen::half(val);
        }
    }

    {
        auto start = std::chrono::system_clock::now();
        auto res = calcMat<MatrixXf, MatrixXfCol>(MatA, MatBColMajor);
        auto end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;
        std::cout << "eigen float32 time used: " << elapsed_seconds.count() << std::endl;
    }

    {
        auto start = std::chrono::system_clock::now();
        auto res = calcMat<MatrixXf16, MatrixXf16Col>(MatA16, MatB16ColMajor);
        auto end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;
        std::cout << "eigen fp16 time used: " << elapsed_seconds.count() << std::endl;
    }

    {
        auto start = std::chrono::system_clock::now();
        auto res = calcMat<MatrixXHalf, MatrixXHalfCol>(MatAHalf, MatBHalfColMajor);
        auto end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;
        std::cout << "eigen Half time used: " << elapsed_seconds.count() << std::endl;
    }
}

Then call testRun(1000000, 500)

Anything else that might help

I've also tested some other matrix operation like:

template<typename MatA, typename MatB>
MatA calcVec(const MatA& p_MatA, const MatB& p_MatB)
{
    return (p_MatA.rowwise() - p_MatB).rowwise().squaredNorm().transpose();
}

And given MatA is 10000000x768, MatB is 1x768, both RowMajor,

This time float cost 2.36102sec, _Float16 1.24202sec, Eigen::half 54.0216sec

Have a plan to fix this issue.