Use CUDA-aware MPI when available

This can massively speed up the Laplacian kernel for domain parallel runs because the transfer of the boundary points can happen directly from device to device and can overlap with the processing of the inner points.