Seg Fault when running flat-bottom position restraints with MPI - Redmine #2095
Archive from user: Yunlong Liu
I compiled gromacs (git master branch & 2016.1 release) with the following settings:
+ GCC 5.2.0 / GCC 4.9.2
- OpenMpi 2.0.1 / Mpich 3.2
- OpenMP enabled
- FFTW 3.3.5
- AVX2_256
- CUDA 7.5
- CUDA_HOST_COMPILER 4.9.2
In my position restraint topology files, I applied flat-bottom position restraints to three atoms. But when I started my gromacs job using
mpirun -np 4 gmx_mpi mdrun ...
The OpenMPI outputs a seg fault:
[gpu072:50339] *** Process received signal ***
[gpu072:50339] Signal: Segmentation fault (11)
[gpu072:50339] Signal code: Address not mapped (1)
[gpu072:50339] Failing at address: (nil)
[gpu072:50338] *** Process received signal ***
[gpu072:50338] Signal: Segmentation fault (11)
[gpu072:50338] Signal code: Address not mapped (1)
[gpu072:50338] Failing at address: (nil)
[gpu072:50339] [ 0] /lib64/libpthread.so.0(+0xf790)[0x2aaaaf001790]
[gpu072:50339] [ 1] [gpu072:50338] [ 0] /lib64/libpthread.so.0(+0xf790)[0x2aaaaf001790]
[gpu072:50338] [ 1] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x49662b)[0x2aaaab16362b]
[gpu072:50339] [ 2] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x49662b)[0x2aaaab16362b]
[gpu072:50338] [ 2] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x497fe2)[0x2aaaab164fe2]
[gpu072:50339] [ 3] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x497fe2)[0x2aaaab164fe2]
[gpu072:50338] [ 3] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_Z17dd_make_local_topP12gmx_domdec_tP18gmx_domdec_zones_tiPA3_fPfPiP10t_forcerecS4_P11gmx_vsite_tPK10gmx_mtop_tP14gmx_localtop_t+0x354)[0x2aaaab1654bd]
[gpu072:50339] [ 4] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_Z17dd_make_local_topP12gmx_domdec_tP18gmx_domdec_zones_tiPA3_fPfPiP10t_forcerecS4_P11gmx_vsite_tPK10gmx_mtop_tP14gmx_localtop_t+0x354)[0x2aaaab1654bd]
[gpu072:50338] [ 4] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_Z19dd_partition_systemP8_IO_FILElP9t_commreciiP7t_statePK10gmx_mtop_tPK10t_inputrecS4_PSt6vectorIN3gmx11BasicVectorIfEESaISE_EEP9t_mdatomsP14gmx_localtop_tP10t_forcerecP11gmx_vsite_tP10gmx_constrP6t_nrnbP13gmx_wallcyclei+0x1464)[0x2aaaab15c890]
[gpu072:50339] [ 5] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_Z19dd_partition_systemP8_IO_FILElP9t_commreciiP7t_statePK10gmx_mtop_tPK10t_inputrecS4_PSt6vectorIN3gmx11BasicVectorIfEESaISE_EEP9t_mdatomsP14gmx_localtop_tP10t_forcerecP11gmx_vsite_tP10gmx_constrP6t_nrnbP13gmx_wallcyclei+0x1464)[0x2aaaab15c890]
[gpu072:50338] [ 5] gmx_mpi[0x429f6e]
[gpu072:50339] [ 6] gmx_mpi[0x423b91]
[gpu072:50339] [ 7] gmx_mpi[0x429f6e]
[gpu072:50338] [ 6] gmx_mpi[0x423b91]
[gpu072:50338] [ 7] gmx_mpi[0x428150]
[gpu072:50339] [ 8] gmx_mpi[0x428150]
[gpu072:50338] [ 8] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x452977)[0x2aaaab11f977]
[gpu072:50339] [ 9] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x452977)[0x2aaaab11f977]
[gpu072:50338] [ 9] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x38d)[0x2aaaab12142d]
[gpu072:50339] [10] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x38d)[0x2aaaab12142d]
[gpu072:50338] [10] gmx_mpi[0x41941c]
[gpu072:50338] [11] gmx_mpi[0x41941c]
[gpu072:50339] [11] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aaaaf22dd5d]
[gpu072:50338] [12] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aaaaf22dd5d]
[gpu072:50339] [12] gmx_mpi[0x419299]
[gpu072:50338] *** End of error message ***
gmx_mpi[0x419299]
[gpu072:50339] *** End of error message ***
The OpenMPI’s debugger stacktrace shows that it is in the do_make_local_top() function in the domdec.h outputs this segfault.
However, when I removed the mpirun, in other words, when I ran the tpr using only one process with multiple threads, I didn’t get any seg fault.
I attached the tpr file that can trigger this seg fault.
(from redmine: issue id 2095, created on 2016-12-29 by gmxdefault, closed on 2017-01-20)
- Relations:
- duplicates #2236 (closed)
- Changesets:
- Revision 9a45db56 by Berk Hess on 2017-01-05T16:56:06Z:
Fix flat-bottom position restraints + DD + OpenMP
When using flat-bottom position restraints with DD and OpenMP
a (re)allocation was missing, causing a segv.
Fixes #2095.
Change-Id: I03af546a0b8d03a3d384d86a2582a67584e72d46
- Uploads: