mdrun hangs upon "-nsteps " or "-maxh" trigger with more than 20 MPI processes - Redmine #2131

Archive from user: Jan-Philipp Machtens

Dear all,
In standard MD simulations with 20 or more MPI processes in total, my mdrun hangs (GROMACS 2016.2), when either “-nsteps ” or “-maxh ” should trigger mdrun termination.
I extensively tested on CPU-only nodes (2x E5-2680 v3 each)
(1) mdrun using thread MPI on a single node
(2) mdrun compiled up-to-date ParaStation MPI across 1,2,4, or 5 nodes

Summary:
Mdrun did not terminate upon a -nsteps/-maxh trigger, whenever the total number of MPI threads or MPI processes across all nodes was equal/larger than 20, irrespective of the number of OpenMP threads per MPI/process.
I tested GROMACS 5.1.x versions, GROMACS 2016, and GROMACS 2016.1, and it appears that this issue is specific to GROMACS 2016.2.

When GROMACS hangs, the output looks like this:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
GROMACS:      gmx mdrun, version 2016.2
Executable:   /home/XXX/XXX/bin/gromacs-2016.2-threadMPI/bin/gmx
Data prefix:  /home/XXX/XXX/bin/gromacs-2016.2-threadMPI
Working dir:  /work/XXX/XXX/test-norestraint
Command line:
  gmx mdrun -nsteps 300 -ntomp 2
Running on 1 node with total 24 cores, 48 logical cores
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  Hardware topology: Basic
Reading file topol.tpr, VERSION 5.1.4 (single precision)
Note: file tpx version 103, software tpx version 110
Overriding nsteps with value passed on the command line: 300 steps, 1.2 ps
Will use 16 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 24 MPI threads
Using 2 OpenMP threads per tMPI thread
starting mdrun 'Protein'
300 steps,      1.2 ps.
step 40 Turning on dynamic load balancing, because the performance loss due to load imbalance is 13.0 %.
Writing final coordinates.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Does anyone know a solution?
Many thanks in advance!!!

Jan-Philipp

Dr. Jan-Philipp Machtens Computational Neurophysiology Group Institute of Complex Systems - Zelluläre Biophysik (ICS-4) Forschungszentrum Jülich, Germany

(from redmine: issue id 2131, created on 2017-03-06 by gmxdefault, closed on 2017-03-13)

Relations:
- relates #2134 (closed)
- relates #1781 (closed)
- relates #2041 (closed)
Changesets:
- Revision 66ec44e6 by Szilárd Páll on 2017-03-09T15:08:04Z:

Fix mdrun hanging upon exit with sep PME ranks

Commit 1d2d95e introduced a check and early return to skip printing perf
stats when no valid wallcycle data was collected (due to missed reset).
However, as the validity of wallcycle data does not get checked/recorded
on separate PME ranks, mdrun deadlocks before exit in collective comm
that PME ranks never enter.

This change fixes the hang by refactoring the printing code to use a
boolean rather than an early return. This means the normal code path
is unaffected in all cases (only the simulation master can ever write
reports), and the case where it is invalid to write a report
(premature termination) works correctly because all ranks communicate
the data for the report that is never written (and efficiency is not
of concern in this case).

Fixes #2131

Change-Id: If8b0813444d0b00a1a9a4a21d30fc8655c52752a