GPU direct communicaton with CUDA-aware lib-MPI
Implement CUDA-aware MPI support to be able to run across multiple nodes.
Earlier related conversation on the parent issue prior to splitting this out: #2915 (comment 519222282)
Edited by Szilárd Páll