GROMACS 2022.3 has issues with checkpointing expanded ensemble simulations
Summary
Expanded ensemble simulations performed by GROMACS 2022.3 (including the thread-MPI and MPI-enabled versions) seem to easily fail with a segmentation fault in a few ns of simulations. This situation did not happen with previous GROMACS versions. Specifically, using GROMACS 2021.4, I was able to run several expanded ensemble simulations for hundreds of ns using exactly the same set of input GRO/TOP/MDP files without having any error, but several trials that I performed using GROMACS 2022.3 all failed with a segmentation error.
GROMACS version
GROMACS 2022.3, including the thread-MPI and MPI-enabled versions.
Steps to reproduce
Here I have attached a compressed folder issue
(issue.zip), which contains two subfolders inputs
and outputs
. The subfolder inputs
contains the input TPR file and the corresponding GRO, TOP and MDP files. The file mdout.mdp
shows the adopted values of all MD parameters and the command I used to generate the TPR file. The subfolder outputs
contains the LOG files and the SLURM outputs of two trials respectively performed with the thread-MPI and MPI-enabled versions of GROMACS 2022.3. With these files, one could reproduce the error using the following mdrun
command (which is also logged in the log file) using the thread-MPI version of GROMACS 2022.3:
gmx mdrun -s anthracene_EXE.tpr -x anthracene_EXE.xtc -c anthracene_output.gro -g md.log -e md.edr -cpi state.cpt -dhdl dhdl.xvg
In the case that MPI-enabled version of GROMACS 2022.3 is used, the command is the same except that gmx_mpi
is used instead of gmx
. The system is small enough (around 3000 atoms) that a few ns of simulations should not take too long.
What is the current bug behavior?
As can be checked in the SLURM outputs that I attached, for the thread-MPI version of GROMACS 2022.3, the simulation failed with the following message:
/var/spool/slurm/d/job12026312/slurm_script: line 12: 39069 Segmentation fault (core dumped) gmx mdrun -s anthracene_EXE.tpr -x anthracene_EXE.xtc -c anthracene_output.gro -g md.log -e md.edr -cpi state.cpt -dhdl dhdl.xvg
In the case where the MPI-enabled version was used, the simulation failed with the following message:
[r005:50899:0:50899] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1200000000)
==== backtrace (tid: 50899) ====
0 0x0000000000055799 ucs_debug_print_backtrace() ???:0
1 0x0000000000012dd0 .annobin_sigaction.c() sigaction.c:0
2 0x0000000000e68ef8 xdr_float() ???:0
3 0x0000000000e69082 xdr_vector() ???:0
4 0x0000000000e5078d doVectorLow<float, std::allocator<float>, StateFepEntry>() checkpoint.cpp:0
5 0x0000000000e5587c do_cpt_df_hist() checkpoint.cpp:0
6 0x0000000000e565cd write_checkpoint_data() ???:0
7 0x0000000000c72adb write_checkpoint() mdoutf.cpp:0
8 0x0000000000c72f41 mdoutf_write_checkpoint() ???:0
9 0x0000000000f30c16 gmx::CheckpointHelper::writeCheckpoint() ???:0
10 0x0000000000f30cb6 gmx::CheckpointHelper::run() ???:0
11 0x0000000000fa4fc7 gmx::ModularSimulatorAlgorithm::populateTaskQueue() ???:0
12 0x0000000000fa5823 gmx::ModularSimulatorAlgorithm::getNextTask() ???:0
13 0x0000000000f4faf5 gmx::ModularSimulator::run() ???:0
14 0x0000000000e01670 gmx::Mdrunner::mdrunner() ???:0
15 0x0000000000409297 gmx::gmx_mdrun() ???:0
16 0x00000000004093b8 gmx::gmx_mdrun() ???:0
17 0x0000000000670aff gmx::CommandLineModuleManager::run() ???:0
18 0x0000000000405ebd main() ???:0
19 0x00000000000236a3 __libc_start_main() ???:0
20 0x0000000000405f2e _start() ???:0
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 50899 on node r005 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
What did you expect the correct behavior to be?
The simulation should have been finished successfully like when GROMACS 2021.4 is used.