Description

Use all GPUs on a node in the CUDA backend in a round-robin fashion with the MPI ranks, similar to the OpenCL backend. Before, only the first device was used by all ranks on a node.