maxh option and checkpoint writting do not work with REMD simulations - Redmine #1942

Archive from user: Maud Jusot

I don’t manage to restart correctly REMD simulation (I posted it on the gmx-users mailing list here : https://www.mail-archive.com/gromacs.org_gmx-users@maillist.sys.kth.se/msg18550.html ) but to summarize it :
the first simulation has no problem and finish correctly after maxh time, but when it restarts no checkpoint files are written anymore and gromacs does not stop at maxh time (even if it says it does in the output).

I tried it with 3 different versions of gromacs (4.6.5, 5.1.0 and 5.1.2) on two different clusters, so I am quite sure the problem does not come from the installation nor from the version.

(from redmine: issue id 1942, created on 2016-04-06 by gmxdefault, closed on 2016-06-27)

Relations:
- relates #692 (closed)
Changesets:
- Revision bc98987b by Mark Abraham on 2016-06-23T12:53:55Z:

Prevent use of mdrun -maxh -multi

A proper fix can probably be made in release-2016, and if so, the
content of this commit should not be merged forward.

Refs #1942

Change-Id: Ie7e6c0ca25fba09ad1794cacbe116b03e95ff0f9

Revision d5bd278b by Mark Abraham on 2016-06-27T17:31:02Z:

Removed unnecessary inter-simulation signalling

Generally, multi-simulation runs do not need to couple the simulations
(discussion at #692). Individual algorithms implemented with
multi-simulations might need to do so, but should take care of their
own details, and now do. Scaling should improve in the cases where
simulations are now decoupled.

It is unclear what the expected behaviour of a multi-simulation should
be if the user supplies any of the possible non-uniform distributions
of init_step and nsteps, sourced from any of .mdp, .cpt or command
line. Instead, we report on the non-uniformity and proceed. It's
always possible that the user knows what they are doing. In
particular, now that multi-simulations are no longer explicitly
coupled, any heterogeneity in the execution environment will lead to
checkpoints and -maxh acting at different time steps, unless a
user-selected algorithm requires that the simulations stay coordinated
(e.g. REMD or ensemble restraints).

In the implementation of signalling, we have stopped checking gs for
NULL as a proxy for whether we should be doing signalling at that
communication phase. Replaced with a helper object in which explicit
flags are set. Added unit tests of that functionality.

Improved documentation of check_nstglobalcomm. mdrun now reports the
number of steps between intra-simulation communication to the
log file.

Noted minor TODOs for future cleanup.

Added some trivial test cases for termination by maxh in normal-MD,
multi-sim and REMD cases. Refactored multi-sim tests to make this
possible without duplication. This is complicated by the way filenames
get changed by mdrun -multi by the former par_fn, so cleaned up the
way that is handled so it can work and be re-used better. Introduced
mdrun integration-test object library to make that build system work a
little better. Made some minor improvements to Doxygen setup for
integration tests.

Fixes #860, #692, #1857, #1942.

Change-Id: I5f7b98f331db801b058ae2b196d79716b5912b09

Uploads:
- tpr_files.tar.gz all the tpr files
- output_files.tar.gz output files and script