Limit the default tMPI rank maximum GPU sharing
With large number of cores on modern HPC machines, default thread-MPI launch can lead to >=8 ranks per GPU which is most likely suboptimal. Therefore, when the tMPI rank count is determined automatically, we limit the maximum number of ranks per GPU; currently this value is set to four.
Partially addresses #4332