Use one-to-all MPI instead of unnecessary MPI_AllReduce
The following discussion from !24 (merged) should be addressed:
-
@francis.casson started a discussion: (+2 comments) Someone was way too fond of MPI_AllReduce... It is a slow MPI operation since it is an all-to-all operation.
What is wrong with using MPI_broadcast for inputs? - much more efficient since only one processor sends.
Likewise, for the output, you should be probably be using MPI_gather - again, way more efficient than an allreduce, since each processor only sends to a single process.