Parallel assignment
Submitted by Jonas Adler
Assigned to Nobody
Link to original bugzilla bug (#1044)
Version: 3.4 (development)
Description
Many users (myself included) use Eigen because of its simplicity in writing fast and clear code. One part where i find it lacking is in support for parallel assignment.
In a recent project I had a line of code similar to this using most of the run-time:
A.row(i) += A.row(j)
The point being that in the end, my Eigen code was as fast as manually writing:
for (int k=0; k<n; k++)
A(i,k) += A(j,k)
But this was on a 8 core machine. If i wanted to improve the performance of the code by parallelizing this loop, I'd be forced to do so myself. By doing this myself I'd need to sacrifice both simplicity and clarity, being forced to do something like:
#pragma omp parallel for
for (int k=0; k<n; k++)
A(i,k) += A(j,k)
My suggestion here (which I would be willing to implement myself if accepted) is to add the following syntax:
A.row(i).parallel() += A.row(j)
This could be implemented similarly to how noalias() is currently implemented. At a lower level this could be written by having an outer loop over blocks inside assign_impl, possibly by adding another template parameter 'Parallelization' with default value 'NoParallelization'. An example in the one dimensional case would be:
template<typename Derived1, typename Derived2, int Unrolling>
struct internal::assign_impl<Derived1, Derived2, LinearVectorization, Unrolling, OMPParallelization>
{
static void run(Derived1 &dst, const Derived2 &src)
{
const int size = dst.size();
const int n_threads = nbThreads();
#pragma omp parallel
{
const int id = omp_get_thread_num();
const Index istart = id*size/n_threads;
Index iend = (id+1)*size/n_threads;
if(id == n_threads-1) iend = size;
internal::assign_implEigen::VectorBlock<Derived1>,Eigen::VectorBlock<Derived2>,LinearVectorization,Unrolling,NoParallelization::run(dst.segment(istart,iend),src.segment(istart,iend));
}
}
};
In the long run, this could also be added to the assign_selector, automatically picking a parallel assignment if deemed correct (for example if the size of the array is large enough), but this can be put of for now.
Is this a decent suggestion that I should try implementing on a fork, or do you think it is to complicated or out of Eigen's scope?
If it is interesting, is my implementation suggestion reasonable, or is there some better way to do it?