Some question about how to make MPI multiple GPU better than one

QE version : git commit tag 9da20ff7
Epyc Rome 7742 + 4 * V100 + NVHPC 21.5
ucx-1.11.0-nvhpc compiled with gdr_copy/cuda.
LD_LIBRARY_PATH=/opt/nonspack/ucx-1.11.0-nvhpc/lib/:/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/nvhpc-21.5-djbdrus6olz7nhhefxymbojz6lp2d23h/Linux_x86_64/21.5/compilers/lib/:/opt/nonspack/ucx-1.11.0-nvhpc/lib/:$LD_LIBRARY_PATH:/opt/spack/opt/spack/linux-debian10-zen2/gcc-10.2.0/nvhpc-21.5-djbdrus6olz7nhhefxymbojz6lp2d23h/Linux_x86_64/21.5/compilers/lib/ mpirun --allow-run-as-root -mca pml ucx -x UCX_TLS=gdr_copy,cuda,dc_mlx5 -np 4 -x LD_LIBRARY_PATH /root/yyw/q-e/sb-mpi/bin/pw.x -i ./ausurf.in

Hi, I've noticed that the Multiple GPU part even with GPU-direct and ucx ib-card switch on, is still utilizing low of the multiple GPU. For case AUSURF112, multiple cards only takes about 40 Watts of every V100 and takes 2 hours. Rather, If I run the same case on single card, it takes 250Watts and finish within 2 minutes. After profiling the multiple cards case. I found the data synchronization bettween GPU is small and sparse. If I do mpirun on single card, it's still pretty fast becasue it only calls cuda_copy for mpi ucx. Do I do something wrong or the multiple GPU support is just it?

The log is log1