Exact checkpoint restarts are not exact in 2018 beta1 - Redmine #2318
Archive from user: Erik Lindahl
Commit 76920a4f breaks binary exact restarts, which in turn prevents
debugging of other patches.
I presume this is because the dynamic pruning keeps a state across the
checkpointing that is not saved. To reproduce, make two copies of the
attacted tpr, use the -reprod flag, run one of them to completion (2000
steps) and interrupt/continue the other one. Prior to 76920a4f, gmx
check will confirm the contents is identical, but not after.
There are three obvious options:
1) Reset the pruning so no state is kept at checkpoint steps
2) Save the state to the checkpoint
3) Disable dynamic pruning on CPUs
Hopefully (1) should not be too difficult, since I assume (2) is
expensive.
However… it is quite embarrassing that we have now broken exact restarts
in every single beta the last ~3 versions. This really shows the
importance of testing checkpointing restarts for binary identity **ANY**
time we do anything related either to any state or checkpoint I/O.
*(from redmine: issue id 2318, created on 2017-12-01 by gmxdefault, closed on 2017-12-05)*
* Changesets:
* Revision 50a7265fdeb486151fb4401049bb4e4f86701f27 by Berk Hess on 2017-12-04T08:27:10Z:
```
Only stop at nstlist steps with -reprod
Stopping mdrun with two INT or TERM signals would always happen right
after the first global communication step. But this breaks exact
continuation. Now with mdrun -reprod a second signal will still stop
at a pair-list generation step, like with the first signal, so we can
still have exact continuation.
Refs #2318
Change-Id: If65c1215d2509d60c1c5a6444769e7809288e798
```
* Revision 700d6f3870888d7f17d9c50bea477159369a3483 by Berk Hess on 2017-12-05T00:31:09Z:
```
Fix DD exact continuation bug
With domain decomposition the local atom density, used for setting
the search grid for sorting particles, was based on the local atom
count including atoms/charge groups that would be moved to
neighboring cells. This lead do a different density value, which in turn
could result in a different number of search grid cells and thus
a different summation order during a run versus when continuing a run
from checkpoint, when no atoms would be moved. Now exact continuation
is guaranteed for the domdec module with the mdrun -reprod option.
Refs #2318
Change-Id: I78452c7dfcf3ca6bdee63ece3795efc7e4ac467f
```
* Uploads:
* [run.tpr](/uploads/56106ff2fc48f91611722b2bc97977b5/run.tpr)
* [short.tpr](/uploads/e8a607cb1fa5e0d144f3e7da8d1b311e/short.tpr)
* [short.log](/uploads/d3f3b2ff8b9a039d180819ccbc3c3439/short.log)
issue