Hexagonal PBC and MPI - Redmine #2125
Archive from user: Bart Bruininks
Hey GROMACS people,
I was recently trying to increase the efficiency of my membrane-particle fusion box, by changing the way PBC works to be more like a hexagon (10 10 10 0 0 5 0 10 5). I know this is not a perfect regular hexagon, for the ratio of the box should not be a square, but something like 10*3^0.5/2, but hey I guess it should also work. When I create a box with these dimension I can run perfectly on a hyperthreaded 6 core. However, when I move towards multiple nodes and start using MPI with GROMACS 2016 stuff starts to go severely wrong in less than 100 steps. I tried different versions of GROMACS (5.1.1 & 5.1.4 & 2016.1), but the issue was always the same. I couldn’t say with 100% certainty that it is indeed the MPI which causes it to go wrong, but whenever I start asking for more cores than one node can provide the issue presents itself. I myself am not a hard programmer and wouldn’t be able to solve the issue nor find the exact thing which goes wrong, but I just would like to point out the problem. I will attach the md.tpr file I have to run (It is a MARTINI system, but that should also not matter too much, running with -rdd 2.0 might be necessary though). Though a possibly a small bug, I think it could be worth it to solve this, for these hexagonal boxes are very nice for any particle migrating into a membrane.
Cheers and hopefully it can be resolved,
Bart
(from redmine: issue id 2125, created on 2017-02-16 by gmxdefault, closed on 2018-01-09)
- Changesets:
- Revision b1a0f28e by Berk Hess on 2018-01-08T18:37:34Z:
Fix triclinic domain decomposition bug
With triclinic unit-cells with vectors a,b,c, the domain decomposition
would communicate an incorrect halo along dimension x when b[x]!=0
and vector c not parallel to the z-axis. The halo cut-off bound plane
was tilted incorrect along x/z with an error approximately
proportional to b[x]*(c[x] - b[x]*c[y]/b[y]).
When c[x] > b[x]*c[y]/b[y], the communicated halo was too small, which
could cause instabilities or silent errors.
When c[x] < b[x]*c[y]/b[y], the communicated halo was too large, which
could cause some communication overhead.
Fixes #2125
Change-Id: I2109542292beca5be26eddc262e0974c4ae825ea
- Revision 3a338158 by Berk Hess on 2018-01-09T07:45:26Z:
Fix triclinic domain decomposition bug
With triclinic unit-cells with vectors a,b,c, the domain decomposition
would communicate an incorrect halo along dimension x when b[x]!=0
and vector c not parallel to the z-axis. The halo cut-off bound plane
was tilted incorrect along x/z with an error approximately
proportional to b[x]*(c[x] - b[x]*c[y]/b[y]).
When c[x] > b[x]*c[y]/b[y], the communicated halo was too small, which
could cause instabilities or silent errors.
When c[x] < b[x]*c[y]/b[y], the communicated halo was too large, which
could cause some communication overhead.
Fixes #2125
Change-Id: I2109542292beca5be26eddc262e0974c4ae825ea
(cherry picked from commit b1a0f28eb503c5e7974dc8c998797cb71c3f0b42)
- Uploads:
- md.tpr The tar which should reproduce the bug when used in combination with MPI
- md-rdd-2.log
- md-no-rdd.log
- md-single-rank.log