mdrun hangs upon "-nsteps " or "-maxh" trigger with more than 20 MPI processes - Redmine #2131
Archive from user: Jan-Philipp Machtens Dear all, In standard MD simulations with 20 or more MPI processes in total, my mdrun hangs (GROMACS 2016.2), when either “-nsteps ” or “-maxh ” should trigger mdrun termination. I extensively tested on CPU-only nodes (2x E5-2680 v3 each) (1) mdrun using thread MPI on a single node (2) mdrun compiled up-to-date ParaStation MPI across 1,2,4, or 5 nodes Summary: Mdrun did not terminate upon a -nsteps/-maxh trigger, whenever the total number of MPI threads or MPI processes across all nodes was equal/larger than 20, irrespective of the number of OpenMP threads per MPI/process. I tested GROMACS 5.1.x versions, GROMACS 2016, and GROMACS 2016.1, and it appears that this issue is specific to GROMACS 2016.2. When GROMACS hangs, the output looks like this: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% GROMACS: gmx mdrun, version 2016.2 Executable: /home/XXX/XXX/bin/gromacs-2016.2-threadMPI/bin/gmx Data prefix: /home/XXX/XXX/bin/gromacs-2016.2-threadMPI Working dir: /work/XXX/XXX/test-norestraint Command line: gmx mdrun -nsteps 300 -ntomp 2 Running on 1 node with total 24 cores, 48 logical cores Hardware detected: CPU info: Vendor: Intel Brand: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz SIMD instructions most likely to fit this hardware: AVX2_256 SIMD instructions selected at GROMACS compile time: AVX2_256 Hardware topology: Basic Reading file topol.tpr, VERSION 5.1.4 (single precision) Note: file tpx version 103, software tpx version 110 Overriding nsteps with value passed on the command line: 300 steps, 1.2 ps Will use 16 particle-particle and 8 PME only ranks This is a guess, check the performance at the end of the log file Using 24 MPI threads Using 2 OpenMP threads per tMPI thread starting mdrun 'Protein' 300 steps, 1.2 ps. step 40 Turning on dynamic load balancing, because the performance loss due to load imbalance is 13.0 %. Writing final coordinates. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Does anyone know a solution? Many thanks in advance!!! Jan-Philipp Dr. Jan-Philipp Machtens Computational Neurophysiology Group Institute of Complex Systems - Zelluläre Biophysik (ICS-4) Forschungszentrum Jülich, Germany *(from redmine: issue id 2131, created on 2017-03-06 by gmxdefault, closed on 2017-03-13)* * Relations: * relates #2134 * relates #1781 * relates #2041 * Changesets: * Revision 66ec44e6ffaf2a72b17a01aa16cd451bf7968217 by Szilárd Páll on 2017-03-09T15:08:04Z: ``` Fix mdrun hanging upon exit with sep PME ranks Commit 1d2d95e introduced a check and early return to skip printing perf stats when no valid wallcycle data was collected (due to missed reset). However, as the validity of wallcycle data does not get checked/recorded on separate PME ranks, mdrun deadlocks before exit in collective comm that PME ranks never enter. This change fixes the hang by refactoring the printing code to use a boolean rather than an early return. This means the normal code path is unaffected in all cases (only the simulation master can ever write reports), and the case where it is invalid to write a report (premature termination) works correctly because all ranks communicate the data for the report that is never written (and efficiency is not of concern in this case). Fixes #2131 Change-Id: If8b0813444d0b00a1a9a4a21d30fc8655c52752a ```
issue