Highly parallel GROMACS 2022 simulations quickly crash
**Summary** Using the benchMEM, benchRIB, or benchPEP input files at https://www.mpinat.mpg.de/grubmueller/bench with GROMACS 2022 in highly parallel scenarios (e.g. on two or more nodes with each 96 cores) leads to crashes within the first ~400 to 50,000 time steps, producing step*.pdb output files and error messages of kind - step 400: One or more water molecules can not be settled. - Atom 73864 moved more than the distance allowed by the domain decomposition (0.825614) in direction Y The same .tprs run fine on identical hardware when GROMACS 2020 or 2021 is used. **GROMACS version** 2022 2022.1 **Steps to reproduce** In my hands the bug shows up with any of the following input files: https://www.mpinat.mpg.de/benchMEM https://www.mpinat.mpg.de/benchRIB https://www.mpinat.mpg.de/benchPEP Whereas with v 2021 all runs are fine, with v 2022 some of the runs abort, with the possibility for crashes being higher the larger the number of MPI ranks. Currently running on 96 core AMD nodes, using various possibilities N_MPI x N_OpenMP = 96 * N_nodes ``` 1 node -> 48 simulations out of a total of 48 run fine 2 nodes -> only 43 out of 48 run fine, rest aborts 4 nodes -> only 32 of 48 run fine 8 nodes -> only 19 of 48 run fine 64 nodes -> only 6 of 32 run fine ``` Behavior can be reproduced with gcc-7.3.1 +zen and gcc-10.3.1 +zen3 optimizations enabled. ``` GROMACS: gmx mdrun, version 2022.1-spack Executable: /fsx/spack/opt/spack/linux-amzn2-zen/gcc-7.3.1/gromacs-2022.1-tinuujqkxqqivlzwjxioqacrgakfu2jv/bin/gmx_mpi Data prefix: /fsx/spack/opt/spack/linux-amzn2-zen/gcc-7.3.1/gromacs-2022.1-tinuujqkxqqivlzwjxioqacrgakfu2jv Working dir: /fsx/aws_benchmark/bench_scaling/hpc6a/Gromacs-2022.1-gcc7.3.1-zen/n04 Process ID: 8862 Command line: gmx_mpi mdrun -npme 0 -pin on -s /home/ec2-user/tpr/benchMEM.tpr -nsteps 50000 -resethway -noconfout -cpt 1440 -deffnm benchMEM.nodes4.ranks48.threads2.run01 -dlb no -ntomp 2 GROMACS version: 2022.1-spack Precision: mixed Memory model: 64 bit MPI library: MPI OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128) GPU support: disabled SIMD instructions: AVX2_128 CPU FFT library: fftw-3.3.10-sse2-avx-avx2-avx2_128 GPU FFT library: none RDTSCP usage: enabled TNG support: enabled Hwloc support: hwloc-2.6.0 Tracing support: disabled C compiler: /fsx/spack/lib/spack/env/gcc/gcc GNU 7.3.1 C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O2 -g -DNDEBUG C++ compiler: /fsx/spack/lib/spack/env/gcc/g++ GNU 7.3.1 C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O2 -g -DNDEBUG Running on 4 nodes with total 384 cores, 384 processing units Cores per node: 96 Logical processing units per node: -96 - 96 OS CPU Limit / recommended threads to start per node: 96 Hardware detected on host hpc6a-dy-hpc6a-48xlarge-23 (the node of MPI rank 0): CPU info: Vendor: AMD Brand: AMD EPYC 7R13 Processor Family: 25 Model: 1 Stepping: 1 Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3 x2apic Hardware topology: Full, with devices Packages, cores, and logical processors: [indices refer to OS logical processors] Package 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31] [ 32] [ 33] [ 34] [ 35] [ 36] [ 37] [ 38] [ 39] [ 40] [ 41] [ 42] [ 43] [ 44] [ 45] [ 46] [ 47] Package 1: [ 48] [ 49] [ 50] [ 51] [ 52] [ 53] [ 54] [ 55] [ 56] [ 57] [ 58] [ 59] [ 60] [ 61] [ 62] [ 63] [ 64] [ 65] [ 66] [ 67] [ 68] [ 69] [ 70] [ 71] [ 72] [ 73] [ 74] [ 75] [ 76] [ 77] [ 78] [ 79] [ 80] [ 81] [ 82] [ 83] [ 84] [ 85] [ 86] [ 87] [ 88] [ 89] [ 90] [ 91] [ 92] [ 93] [ 94] [ 95] CPU limit set by OS: -1 Recommended max number of threads: 96 Numa nodes: Node 0 (99241918464 bytes mem): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Node 1 (99332898816 bytes mem): 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Node 2 (99332894720 bytes mem): 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 Node 3 (99331854336 bytes mem): 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 Latency: 0 1 2 3 0 1.00 1.20 3.20 3.20 1 1.20 1.00 3.20 3.20 2 3.20 3.20 1.00 1.20 3 3.20 3.20 1.20 1.00 Caches: L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways L2: 524288 bytes, linesize 64 bytes, assoc. 8, shared 1 ways L3: 33554432 bytes, linesize 64 bytes, assoc. 16, shared 8 ways PCI devices: 0000:00:01.3 Id: 8086:7113 Class: 0x0000 Numa: -1 0000:00:03.0 Id: 1d0f:1111 Class: 0x0300 Numa: -1 0000:00:04.0 Id: 1d0f:8061 Class: 0x0108 Numa: -1 0000:00:05.0 Id: 1d0f:ec20 Class: 0x0200 Numa: -1 0000:00:06.0 Id: 1d0f:efa1 Class: 0x0200 Numa: -1 Highest SIMD level supported by all nodes in run: AVX2_256 SIMD instructions selected at compile time: AVX2_128 Compiled SIMD newer than supported; program might crash. ```
issue