Skip to content
GitLab
    • GitLab: the DevOps platform
    • Explore GitLab
    • Install GitLab
    • How GitLab compares
    • Get started
    • GitLab docs
    • GitLab Learn
  • Pricing
  • Talk to an expert
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    • Switch to GitLab Next
    Projects Groups Topics Snippets
  • Register
  • Sign in
  • GROMACS GROMACS
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
    • Locked files
  • Issues 325
    • Issues 325
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 104
    • Merge requests 104
  • Deployments
    • Deployments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
    • Model experiments
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • GROMACS
  • GROMACSGROMACS
  • Issues
  • #4629
Closed
Open
Issue created Oct 11, 2022 by Wei-Tse Hsu@wehs7661

GROMACS 2022.3 has issues with checkpointing expanded ensemble simulations

Summary

Expanded ensemble simulations performed by GROMACS 2022.3 (including the thread-MPI and MPI-enabled versions) seem to easily fail with a segmentation fault in a few ns of simulations. This situation did not happen with previous GROMACS versions. Specifically, using GROMACS 2021.4, I was able to run several expanded ensemble simulations for hundreds of ns using exactly the same set of input GRO/TOP/MDP files without having any error, but several trials that I performed using GROMACS 2022.3 all failed with a segmentation error.

GROMACS version

GROMACS 2022.3, including the thread-MPI and MPI-enabled versions.

Steps to reproduce

Here I have attached a compressed folder issue (issue.zip), which contains two subfolders inputs and outputs. The subfolder inputs contains the input TPR file and the corresponding GRO, TOP and MDP files. The file mdout.mdp shows the adopted values of all MD parameters and the command I used to generate the TPR file. The subfolder outputs contains the LOG files and the SLURM outputs of two trials respectively performed with the thread-MPI and MPI-enabled versions of GROMACS 2022.3. With these files, one could reproduce the error using the following mdrun command (which is also logged in the log file) using the thread-MPI version of GROMACS 2022.3:

gmx mdrun -s anthracene_EXE.tpr -x anthracene_EXE.xtc -c anthracene_output.gro -g md.log -e md.edr -cpi state.cpt -dhdl dhdl.xvg

In the case that MPI-enabled version of GROMACS 2022.3 is used, the command is the same except that gmx_mpi is used instead of gmx. The system is small enough (around 3000 atoms) that a few ns of simulations should not take too long.

What is the current bug behavior?

As can be checked in the SLURM outputs that I attached, for the thread-MPI version of GROMACS 2022.3, the simulation failed with the following message:

/var/spool/slurm/d/job12026312/slurm_script: line 12: 39069 Segmentation fault      (core dumped) gmx mdrun -s anthracene_EXE.tpr -x anthracene_EXE.xtc -c anthracene_output.gro -g md.log -e md.edr -cpi state.cpt -dhdl dhdl.xvg

In the case where the MPI-enabled version was used, the simulation failed with the following message:

[r005:50899:0:50899] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1200000000)
==== backtrace (tid:  50899) ====
 0 0x0000000000055799 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000012dd0 .annobin_sigaction.c()  sigaction.c:0
 2 0x0000000000e68ef8 xdr_float()  ???:0
 3 0x0000000000e69082 xdr_vector()  ???:0
 4 0x0000000000e5078d doVectorLow<float, std::allocator<float>, StateFepEntry>()  checkpoint.cpp:0
 5 0x0000000000e5587c do_cpt_df_hist()  checkpoint.cpp:0
 6 0x0000000000e565cd write_checkpoint_data()  ???:0
 7 0x0000000000c72adb write_checkpoint()  mdoutf.cpp:0
 8 0x0000000000c72f41 mdoutf_write_checkpoint()  ???:0
 9 0x0000000000f30c16 gmx::CheckpointHelper::writeCheckpoint()  ???:0
10 0x0000000000f30cb6 gmx::CheckpointHelper::run()  ???:0
11 0x0000000000fa4fc7 gmx::ModularSimulatorAlgorithm::populateTaskQueue()  ???:0
12 0x0000000000fa5823 gmx::ModularSimulatorAlgorithm::getNextTask()  ???:0
13 0x0000000000f4faf5 gmx::ModularSimulator::run()  ???:0
14 0x0000000000e01670 gmx::Mdrunner::mdrunner()  ???:0
15 0x0000000000409297 gmx::gmx_mdrun()  ???:0
16 0x00000000004093b8 gmx::gmx_mdrun()  ???:0
17 0x0000000000670aff gmx::CommandLineModuleManager::run()  ???:0
18 0x0000000000405ebd main()  ???:0
19 0x00000000000236a3 __libc_start_main()  ???:0
20 0x0000000000405f2e _start()  ???:0
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 50899 on node r005 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

What did you expect the correct behavior to be?

The simulation should have been finished successfully like when GROMACS 2021.4 is used.

Assignee
Assign to
Time tracking