Frequency of checking for inter-simulation signalling is too high for large-scale parallel REMD - Redmine #692
I made some observations that the same
.tpr ran 5-10% slower under
REMD than without. I added a new timer for the REMD routines and found
that the cost of doing exchanges only accounted for a small fraction of
the increase. See
for details. My nstlist=5.
REMD uses the “multi-simulation” capability of GROMACS. There is a signalling mechanism that allows the master node of each simulation to communicate to each other. This mechanism controls whether
- the heuristic neighbourlists are updated (but I think this feature is supposed to be disabled until further notice),
- checkpointing will occur soon,
- the simulation will stop soon, and
- the timing counters will reset soon.
Checkpointing occurs after any of the simulation master nodes observes that the run time has exceeded the time that should elapse after the last scheduled checkpoint, when a signal is set such that after the signal is received at the next neighbour-search step in each simulation the checkpoint file is written. However they don’t have to communicate at every neighbour-search step, which is what is happening at the moment. On smaller simulation systems, or with faster processors, or with slower networks, the cost of the intra-simulation communication to cater for 2), 3) and 4) can be too large compared with the cost of doing the simulation itself. As REMD usage, GROMACS scaling and computer sizes improve, this will become an increasingly significant issue.
The code that handles simulation stop conditions also acts on neighbour-search steps, but this seems to me a bit backwards. Non-emergency stop conditions have to be handled in a way that can be synchronized across multi-simulations so that checkpointing is synchronous. However in the absence of 1), there’s no reason why the frequency of checking for 2), 3) and 4) needs to be no larger than nstlist, as currently implemented.
There should be a way of configuring such checks to be dependent on nstglobalcomm, rather than nstlist. Better still might be to introduce nstsignalcomm, to cater for multi-simulation scenario where you have lots of simulations each doing large-scale parallel simulations with fairly frequent neighbour-searching (relative to execution time of a parallel MD ste Now the best scenario can be something like nstsignalcomm = 200, nstglobalcomm = 50 and nstlist=10 (nstsignalcomm should be a multiple of nstglobalcomm, I expect). Inter-simulation global communication happens when steps % nstsignalcomm 0, intra-simulation global communication happens when steps % nstglobalcomm 0 and neighbourlists happen as usual. When 1) is used, then probably nstsignalcomm must equal nstlist and that’s life.
I’m happy to implement this nstsignalcomm feature, but I thought I should solicit comments on whether my analysis is accurate and complete, and whether there are any pitfalls of which people are aware.
(from redmine: issue id 692, created on 2011-02-01 by rolandschulz, closed on 2016-06-27)
- relates #860 (closed)
- relates #1857 (closed)
- relates #1942 (closed)
- Revision d5bd278b by Mark Abraham on 2016-06-27T17:31:02Z:
Removed unnecessary inter-simulation signalling Generally, multi-simulation runs do not need to couple the simulations (discussion at #692). Individual algorithms implemented with multi-simulations might need to do so, but should take care of their own details, and now do. Scaling should improve in the cases where simulations are now decoupled. It is unclear what the expected behaviour of a multi-simulation should be if the user supplies any of the possible non-uniform distributions of init_step and nsteps, sourced from any of .mdp, .cpt or command line. Instead, we report on the non-uniformity and proceed. It's always possible that the user knows what they are doing. In particular, now that multi-simulations are no longer explicitly coupled, any heterogeneity in the execution environment will lead to checkpoints and -maxh acting at different time steps, unless a user-selected algorithm requires that the simulations stay coordinated (e.g. REMD or ensemble restraints). In the implementation of signalling, we have stopped checking gs for NULL as a proxy for whether we should be doing signalling at that communication phase. Replaced with a helper object in which explicit flags are set. Added unit tests of that functionality. Improved documentation of check_nstglobalcomm. mdrun now reports the number of steps between intra-simulation communication to the log file. Noted minor TODOs for future cleanup. Added some trivial test cases for termination by maxh in normal-MD, multi-sim and REMD cases. Refactored multi-sim tests to make this possible without duplication. This is complicated by the way filenames get changed by mdrun -multi by the former par_fn, so cleaned up the way that is handled so it can work and be re-used better. Introduced mdrun integration-test object library to make that build system work a little better. Made some minor improvements to Doxygen setup for integration tests. Fixes #860, #692, #1857, #1942. Change-Id: I5f7b98f331db801b058ae2b196d79716b5912b09