Use one-to-all MPI instead of unnecessary MPI_AllReduce

The following discussion from !24 (merged) should be addressed:

@francis.casson started a discussion: (+2 comments)

Someone was way too fond of MPI_AllReduce... It is a slow MPI operation since it is an all-to-all operation.

What is wrong with using MPI_broadcast for inputs? - much more efficient since only one processor sends.

Likewise, for the output, you should be probably be using MPI_gather - again, way more efficient than an allreduce, since each processor only sends to a single process.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information