The gist of it is that by default, the CUDA driver serialises concurrent GPU usage.
The consequence is that multiple CUDA kernels will not run concurrently on the same GPU.
In certain cases, this can have a significant impact on performance.
The following script is extracted from Nvidia's official documentation.
What can be seen is that the CUDA_VISIBLE_DEVICES environment variable is set to use CUDA devices 0 and 1.
nvidia-smi is then used to set these 2 devices to a EXCLUSIVE_PROCESS state, making them exclusive to only one process.
Then, nvidia-cuda-mps-control is run in deamon mode and locked to core 0 using taskset.
Only now is the application run.
Finally, nvidia-cuda-mps-control is killed and the GPUs are restored to their default state.