Checkpoint not created upon reaching time given in maxh - Redmine #860
Archive from user: Ben Reynwar I’m having a problem with gromacs not terminating as expected when using the maxh option. It occurs when doing a REMD simulation with infinite cutoffs. It does not occur for the first run, but only for the second run that was started from the first run checkpoints. I have been using 2 processors for each replica. The version used is 4.5.5 with a bug fix applied from http://lists.gromacs.org/pipermail/gmx-developers/2011-October/005405.html I’m specifying -maxh 24 and as expected see the following in the stderr output. Step 773882: Run time exceeded 23.760 hours, will terminate the run Step 773876: Run time exceeded 23.760 hours, will terminate the run Step 773880: Run time exceeded 23.760 hours, will terminate the run etc However I can see that the output files continued to be written for another hour until at 25 hours the simulation was terminated by the queueing system. No checkpoint files were produced. The output files show that the simulation continued until about step 797000. I attach the cpt and tpr files for starting a 2 replica simulation that exhibits this problem. *(from redmine: issue id 860, created on 2012-01-10 by gmxdefault, closed on 2016-06-27)* * Relations: * relates #692 * relates #1500 * Changesets: * Revision d5bd278b11bccc0e3d1a5bd4ee417c570fea133c by Mark Abraham on 2016-06-27T17:31:02Z: ``` Removed unnecessary inter-simulation signalling Generally, multi-simulation runs do not need to couple the simulations (discussion at #692). Individual algorithms implemented with multi-simulations might need to do so, but should take care of their own details, and now do. Scaling should improve in the cases where simulations are now decoupled. It is unclear what the expected behaviour of a multi-simulation should be if the user supplies any of the possible non-uniform distributions of init_step and nsteps, sourced from any of .mdp, .cpt or command line. Instead, we report on the non-uniformity and proceed. It's always possible that the user knows what they are doing. In particular, now that multi-simulations are no longer explicitly coupled, any heterogeneity in the execution environment will lead to checkpoints and -maxh acting at different time steps, unless a user-selected algorithm requires that the simulations stay coordinated (e.g. REMD or ensemble restraints). In the implementation of signalling, we have stopped checking gs for NULL as a proxy for whether we should be doing signalling at that communication phase. Replaced with a helper object in which explicit flags are set. Added unit tests of that functionality. Improved documentation of check_nstglobalcomm. mdrun now reports the number of steps between intra-simulation communication to the log file. Noted minor TODOs for future cleanup. Added some trivial test cases for termination by maxh in normal-MD, multi-sim and REMD cases. Refactored multi-sim tests to make this possible without duplication. This is complicated by the way filenames get changed by mdrun -multi by the former par_fn, so cleaned up the way that is handled so it can work and be re-used better. Introduced mdrun integration-test object library to make that build system work a little better. Made some minor improvements to Doxygen setup for integration tests. Fixes #860, #692, #1857, #1942. Change-Id: I5f7b98f331db801b058ae2b196d79716b5912b09 ``` * Uploads: * [run0_0.cpt](/uploads/08a5a97e37d45ea5a636823a1e2b6b2f/run0_0.cpt) * [run0_0.tpr](/uploads/b9a5564a491ab63849f38da84db54999/run0_0.tpr) * [run0_1.cpt](/uploads/29afa21fb00a34b5a6af77c3b64b5edf/run0_1.cpt) * [run0_1.tpr](/uploads/6672508224b34f384acab9378c1bcbb9/run0_1.tpr) * [tahsp_amber03_chirre.itp](/uploads/136a4497474705d1220d6b06e9616289/tahsp_amber03_chirre.itp) * [run0_0.mdp](/uploads/a3219714f940ee0b38c9abfc8f7bc6c6/run0_0.mdp) * [run0_1.mdp](/uploads/30e73c9f241f6e57e0c9820be4517d10/run0_1.mdp) * [tahsp_starting.pdb](/uploads/5be62407cbd313470722de7297f0d72a/tahsp_starting.pdb) * [tahsp_amber03.top](/uploads/afaf58a2511bdd45f7e03b5ecd7514fe/tahsp_amber03.top) * [tahsp_amber03_alphacryst_posre.itp](/uploads/3971ccef539fb22aade056d40fafbeb8/tahsp_amber03_alphacryst_posre.itp) * [run0_0_p.top](/uploads/19a1d7b35e4289f269467b13ea5e66bf/run0_0_p.top) * [run0_1_p.top](/uploads/5cd3ac46b374b002085b2717323cbebe/run0_1_p.top) * [tahsp_amber03_Protein.itp](/uploads/fe33aa0d612a9f9b29e7ca7603a3298e/tahsp_amber03_Protein.itp)
issue