Exact checkpoint restarts are not exact in 2018 beta1 - Redmine #2318
Archive from user: Erik Lindahl Commit 76920a4f breaks binary exact restarts, which in turn prevents debugging of other patches. I presume this is because the dynamic pruning keeps a state across the checkpointing that is not saved. To reproduce, make two copies of the attacted tpr, use the -reprod flag, run one of them to completion (2000 steps) and interrupt/continue the other one. Prior to 76920a4f, gmx check will confirm the contents is identical, but not after. There are three obvious options: 1) Reset the pruning so no state is kept at checkpoint steps 2) Save the state to the checkpoint 3) Disable dynamic pruning on CPUs Hopefully (1) should not be too difficult, since I assume (2) is expensive. However… it is quite embarrassing that we have now broken exact restarts in every single beta the last ~3 versions. This really shows the importance of testing checkpointing restarts for binary identity **ANY** time we do anything related either to any state or checkpoint I/O. *(from redmine: issue id 2318, created on 2017-12-01 by gmxdefault, closed on 2017-12-05)* * Changesets: * Revision 50a7265fdeb486151fb4401049bb4e4f86701f27 by Berk Hess on 2017-12-04T08:27:10Z: ``` Only stop at nstlist steps with -reprod Stopping mdrun with two INT or TERM signals would always happen right after the first global communication step. But this breaks exact continuation. Now with mdrun -reprod a second signal will still stop at a pair-list generation step, like with the first signal, so we can still have exact continuation. Refs #2318 Change-Id: If65c1215d2509d60c1c5a6444769e7809288e798 ``` * Revision 700d6f3870888d7f17d9c50bea477159369a3483 by Berk Hess on 2017-12-05T00:31:09Z: ``` Fix DD exact continuation bug With domain decomposition the local atom density, used for setting the search grid for sorting particles, was based on the local atom count including atoms/charge groups that would be moved to neighboring cells. This lead do a different density value, which in turn could result in a different number of search grid cells and thus a different summation order during a run versus when continuing a run from checkpoint, when no atoms would be moved. Now exact continuation is guaranteed for the domdec module with the mdrun -reprod option. Refs #2318 Change-Id: I78452c7dfcf3ca6bdee63ece3795efc7e4ac467f ``` * Uploads: * [run.tpr](/uploads/56106ff2fc48f91611722b2bc97977b5/run.tpr) * [short.tpr](/uploads/e8a607cb1fa5e0d144f3e7da8d1b311e/short.tpr) * [short.log](/uploads/d3f3b2ff8b9a039d180819ccbc3c3439/short.log)
issue