Generalize parallel GEMM implementation in Core to work with ThreadPool in addition to OpenMP. (!1408) · Merge requests · libeigen / eigen

This generalizes the implementation of parallel dense matrix multiplication in Eigen Core to work with Eigen::ThreadPool, in addition to OpenMP.

Example code:

  #define EIGEN_GEMM_THREADPOOL
  #include <Eigen/Core>

  int num_threads = 8;
  Eigen::ThreadPool pool(num_threads);
  Eigen::setGemmThreadPool(&pool);
  
  Eigen::MatrixXf u, v, x;
  v.setOnes(n, n);  u.setOnes(n, n);  x.setOnes(n, n);
  x.noalias() = v * u;

Initial measurements are in $3618686

Eventually, we want to tie this into the device framework in !1395 (merged), such that you could achieve the same effect with

ThreadPool pool(num_threads);
SimpleThreadPoolDevice device(pool);
x.device(device).noalias() = u * v;

Just to make it clear: The purpose of this MR is not to improve the parallel GEMM implementation in Core, which is still inferior to the parallel tensor contraction. The purpose is to make it available on platforms without OpenMP. Below is a strong scaling plot for n=m=k=4096. This was measured on my Lenovo P920 workstation, which sports 2 sockets x 18 physical cores x 2 threads (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz) running Linux.

Edited Nov 10, 2023 by Rasmus Munk Larsen

Generalize parallel GEMM implementation in Core to work with ThreadPool in addition to OpenMP.

Merge request reports