multi run with >1 rank per simulation exits with MPI_ABORT - Redmine #3296
$ mpirun --mca mpi_abort_print_stack 1 -np 4 $gmx_mpi mdrun -v -multidir repl_{001..002} -nsteps 0 -s topol
GROMACS: gmx mdrun, version 2020.1-dev-20200110-4879d39
Executable: /nethome/pszilard-projects/gromacs/gromacs-20/build_AVX2_256_gcc8_cuda10.1_ompi400/bin/gmx_mpi
Data prefix: /nethome/pszilard/projects/gromacs/gromacs-20 (source tree)
Working dir: /nethome/pszilard-projects/gromacs/bench/LUMI-bench/aqp_ensemble/test_repl-128
Command line:
gmx_mpi mdrun -v -multidir repl_001 repl_002 -nsteps 0 -s topol
Back Off! I just backed up md.log to ./#md.log.11#
Back Off! I just backed up md.log to ./#md.log.11#
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see
log).
Reading file topol.tpr, VERSION 2020.1-dev-20200108-e05cc33 (single precision)
Reading file topol.tpr, VERSION 2020.1-dev-20200108-e05cc33 (single precision)
Overriding nsteps with value passed on the command line: 0 steps, 0 ps
Overriding nsteps with value passed on the command line: 0 steps, 0 ps
Changing nstlist from 40 to 100, rlist from 1.2 to 1.287
Changing nstlist from 40 to 100, rlist from 1.2 to 1.287
[dev-purley01:30476] *** An error occurred in MPI_Allreduce
[dev-purley01:30476] *** reported by process [566165505,3]
[dev-purley01:30476] *** on communicator MPI_COMM_WORLD
[dev-purley01:30476] *** MPI_ERR_COMM: invalid communicator
[dev-purley01:30476] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dev-purley01:30476] *** and potentially your MPI job)
(from redmine: issue id 3296, created on 2020-01-13 by pszilard, closed on 2020-02-17)
- Changesets:
- Revision 4d60bf59 by Berk Hess on 2020-01-16T19:44:26Z:
Correct fixed redmine issue id from 3297 to 3296
Recent commit 36a65816 said it fixed issue 3297, but this should
have been issue 3296.
Refs #3296
Change-Id: I85a95e818d3cda816211dc2aa8ddb32e9e0c69d4