Thread-MPI error in GROMACS-2018 - Redmine #2540
Archive from user: Siva Dasetty
Hello,
I have come across an error that causes GROMACS (2018/2018.1) to crash. The message is:
“tMPI error: Receive buffer size too small for transmission (in valid
comm)
Aborted”
The error seems to only occur immediately following a LINCS or SETTLE warning. The error is reproducible across different systems. A simple example system is running an energy minimization on a box of 1000 rigid TIP4P/Ice water molecules generated with gmx solvate. When SETTLE is used as the constraint algorithm, there are several SETTLE warnings in the early steps of the energy minimization, and GROMACS will crash with the above error message. If I replace SETTLE with LINCS, GROMACS crashes with the same error message following a LINCS warning. Other systems that have produced this error are -OH terminated self assembled monolayer surfaces (h-bonds constrained by LINCS), and mica surfaces (h-bonds constrained by LINCS). Naturally, reducing -ntmpi to 1 eliminates the error for all cases.
The problem does appear to be hardware dependent. Specifically, the tested node(s) on the cluster contains K20/K40 GPUs with Intel Xeon E5-2680v3 processor (20/24 cores). I used GCC/5.4.0 and CUDA/8.0.44 compilers for installing GROMACS. An installation on my desktop machine with with very similar options does not have the thread MPI error.
Example of procedure that causes error:
- Node contains 24 cores and 2 K40 GPUs
gmx solvate -cs tip4p -o box.gro -box 3.2 3.2 3.2 -maxsol 1000
gmx grompp -f em.mdp -c box.gro -p tip4pice.top -o em
export OMP_NUM_THREADS=6
gmx mdrun -v -deffnm em -ntmpi 4 -ntomp 6 -pin on
Attached are the relevant topology (tip4pice.top), mdp (em.mdp), tpr (em.tpr), and log (em.log) files. In addition tip4gro and box.gro files are included.
Thanks in advance for any ideas as to what might be causing this
problem,
Siva Dasetty
(from redmine: issue id 2540, created on 2018-06-01 by gmxdefault, closed on 2018-06-12)
- Changesets:
- Revision dce23f77 by Berk Hess on 2018-06-06T22:33:48Z:
Fix MPI inconsistency in EM after constraint failure
Fixes issue #2540
Change-Id: Id18c17af82f80917388c11fc776b79bf4966a4ac