Generalize parallel GEMM implementation in Core to work with ThreadPool in addition to OpenMP.
This generalizes the implementation of parallel dense matrix multiplication in Eigen Core to work with Eigen::ThreadPool, in addition to OpenMP.
Example code:
#define EIGEN_GEMM_THREADPOOL
#include <Eigen/Core>
int num_threads = 8;
Eigen::ThreadPool pool(num_threads);
Eigen::setGemmThreadPool(&pool);
Eigen::MatrixXf u, v, x;
v.setOnes(n, n); u.setOnes(n, n); x.setOnes(n, n);
x.noalias() = v * u;
Initial measurements are in $3618686
Eventually, we want to tie this into the device framework in !1395 (merged), such that you could achieve the same effect with
ThreadPool pool(num_threads);
SimpleThreadPoolDevice device(pool);
x.device(device).noalias() = u * v;
Just to make it clear: The purpose of this MR is not to improve the parallel GEMM implementation in Core, which is still inferior to the parallel tensor contraction. The purpose is to make it available on platforms without OpenMP. Below is a strong scaling plot for n=m=k=4096
. This was measured on my Lenovo P920 workstation, which sports 2 sockets x 18 physical cores x 2 threads (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
) running Linux.