Error with GPU direct comms enabled
Summary
I have been doing some benchmarking of GROMACS on JUWELS Booster. For a specific benchmark system:
benchRIB.tpr from https://www.mpinat.mpg.de/grubmueller/bench
When I use 16 GPUs with GMX_ENABLE_DIRECT_GPU_COMM=true the simulation fails with error:
Fatal error: Step 600: The total potential energy is nan, which is not finite.
If I run without setting the GMX_ENABLE_DIRECT_GPU_COMM flag then the simulation completes successfully.
Exact steps to reproduce
I have attached the slurm script I used on JUWELS booster and the log file (gmx_error.slurm and md_error.log). I have also attached a submission script and log file for the successful case when the GMX_ENABLE_DIRECT_GPU_COMM flag is not set (gmx_success.slurm and md_success.log)