Hexagonal PBC and MPI - Redmine #2125
Archive from user: Bart Bruininks Hey GROMACS people, I was recently trying to increase the efficiency of my membrane-particle fusion box, by changing the way PBC works to be more like a hexagon (10 10 10 0 0 5 0 10 5). I know this is not a perfect regular hexagon, for the ratio of the box should not be a square, but something like 10\*3^0.5/2, but hey I guess it should also work. When I create a box with these dimension I can run perfectly on a hyperthreaded 6 core. However, when I move towards multiple nodes and start using MPI with GROMACS 2016 stuff starts to go severely wrong in less than 100 steps. I tried different versions of GROMACS (5.1.1 & 5.1.4 & 2016.1), but the issue was always the same. I couldn’t say with 100% certainty that it is indeed the MPI which causes it to go wrong, but whenever I start asking for more cores than one node can provide the issue presents itself. I myself am not a hard programmer and wouldn’t be able to solve the issue nor find the exact thing which goes wrong, but I just would like to point out the problem. I will attach the md.tpr file I have to run (It is a MARTINI system, but that should also not matter too much, running with -rdd 2.0 might be necessary though). Though a possibly a small bug, I think it could be worth it to solve this, for these hexagonal boxes are very nice for any particle migrating into a membrane. Cheers and hopefully it can be resolved, Bart *(from redmine: issue id 2125, created on 2017-02-16 by gmxdefault, closed on 2018-01-09)* * Changesets: * Revision b1a0f28eb503c5e7974dc8c998797cb71c3f0b42 by Berk Hess on 2018-01-08T18:37:34Z: ``` Fix triclinic domain decomposition bug With triclinic unit-cells with vectors a,b,c, the domain decomposition would communicate an incorrect halo along dimension x when b[x]!=0 and vector c not parallel to the z-axis. The halo cut-off bound plane was tilted incorrect along x/z with an error approximately proportional to b[x]*(c[x] - b[x]*c[y]/b[y]). When c[x] > b[x]*c[y]/b[y], the communicated halo was too small, which could cause instabilities or silent errors. When c[x] < b[x]*c[y]/b[y], the communicated halo was too large, which could cause some communication overhead. Fixes #2125 Change-Id: I2109542292beca5be26eddc262e0974c4ae825ea ``` * Revision 3a33815834794d83af60b0d16a5ef22015c4dfdf by Berk Hess on 2018-01-09T07:45:26Z: ``` Fix triclinic domain decomposition bug With triclinic unit-cells with vectors a,b,c, the domain decomposition would communicate an incorrect halo along dimension x when b[x]!=0 and vector c not parallel to the z-axis. The halo cut-off bound plane was tilted incorrect along x/z with an error approximately proportional to b[x]*(c[x] - b[x]*c[y]/b[y]). When c[x] > b[x]*c[y]/b[y], the communicated halo was too small, which could cause instabilities or silent errors. When c[x] < b[x]*c[y]/b[y], the communicated halo was too large, which could cause some communication overhead. Fixes #2125 Change-Id: I2109542292beca5be26eddc262e0974c4ae825ea (cherry picked from commit b1a0f28eb503c5e7974dc8c998797cb71c3f0b42) ``` * Uploads: * [md.tpr](/uploads/c26dfc894cbc7050937e40bf39cb68c9/md.tpr) The tar which should reproduce the bug when used in combination with MPI * [md-rdd-2.log](/uploads/33520beab3e59d3bca07ef0a8feef73e/md-rdd-2.log) * [md-no-rdd.log](/uploads/bb13202cbe0a0c3fc0e37c7d8678a57d/md-no-rdd.log) * [md-single-rank.log](/uploads/216ef5e8be262b38b9d603fca8831086/md-single-rank.log)
issue