Incorrect forces from the pull code with more than 32 DD ranks
I have found some problem with pulling code (specifically dihedral pulling) in the current master branch of Gromacs. As for now I found it only if mdrun running without separate PME rank, which means gmp_mpi mdrun -npme 0
At random timestep total energy becomes "nan" and sim_util.cpp catches that in such kind of a message:
Source file: src/gromacs/mdlib/sim_util.cpp (line 537) MPI rank: 0 (out of 80)
Fatal error: Step 29: The total potential energy is nan, which is not finite. The LJ and electrostatic contributions to the energy are 5381.59 and -42407.4, respectively. A non-finite potential energy can be caused by overlapping interactions in bonded interactions or very large or Nan coordinate values. Usually this is caused by a badly- or non-equilibrated initial configuration, incorrect interactions or parameters in the topology.
I have investigated which energy component causes that and found it to be pulling energy (component 74). Coordinates of the system are fine, restarting from the last geometry also works fine, but then fails again after some time at different geometry. Because of the randomness and dependence on PME configuration, I think that it could be some deeper problem, possibly with memory leaks/corruption. For example, running the same file on a different node causes crash at a different timestep.
I have attached tpr file which has such a problem, it has been built with Gromacs 2020.2, the system is quiet big, but usually it crashes within 100-200 steps.
This affects both single/double precision versions. Also affects Gromacs 2020.2 Release as well as current master branch with different build setups.