Highly parallel GROMACS 2022 simulations quickly crash
**Summary**
Using the benchMEM, benchRIB, or benchPEP input files at https://www.mpinat.mpg.de/grubmueller/bench with GROMACS 2022 in highly parallel scenarios (e.g. on two or more nodes with each 96 cores) leads to crashes within the first ~400 to 50,000 time steps, producing step*.pdb output files and error messages of kind
- step 400: One or more water molecules can not be settled.
- Atom 73864 moved more than the distance allowed by the domain decomposition (0.825614) in direction Y
The same .tprs run fine on identical hardware when GROMACS 2020 or 2021 is used.
**GROMACS version**
2022
2022.1
**Steps to reproduce**
In my hands the bug shows up with any of the following input files:
https://www.mpinat.mpg.de/benchMEM
https://www.mpinat.mpg.de/benchRIB
https://www.mpinat.mpg.de/benchPEP
Whereas with v 2021 all runs are fine, with v 2022 some of the runs abort, with the possibility for crashes being higher the larger the number of MPI ranks. Currently running on 96 core AMD nodes, using various possibilities N_MPI x N_OpenMP = 96 * N_nodes
```
1 node -> 48 simulations out of a total of 48 run fine
2 nodes -> only 43 out of 48 run fine, rest aborts
4 nodes -> only 32 of 48 run fine
8 nodes -> only 19 of 48 run fine
64 nodes -> only 6 of 32 run fine
```
Behavior can be reproduced with gcc-7.3.1 +zen and gcc-10.3.1 +zen3 optimizations enabled.
```
GROMACS: gmx mdrun, version 2022.1-spack
Executable: /fsx/spack/opt/spack/linux-amzn2-zen/gcc-7.3.1/gromacs-2022.1-tinuujqkxqqivlzwjxioqacrgakfu2jv/bin/gmx_mpi
Data prefix: /fsx/spack/opt/spack/linux-amzn2-zen/gcc-7.3.1/gromacs-2022.1-tinuujqkxqqivlzwjxioqacrgakfu2jv
Working dir: /fsx/aws_benchmark/bench_scaling/hpc6a/Gromacs-2022.1-gcc7.3.1-zen/n04
Process ID: 8862
Command line:
gmx_mpi mdrun -npme 0 -pin on -s /home/ec2-user/tpr/benchMEM.tpr -nsteps 50000 -resethway -noconfout -cpt 1440 -deffnm benchMEM.nodes4.ranks48.threads2.run01 -dlb no -ntomp 2
GROMACS version: 2022.1-spack
Precision: mixed
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: disabled
SIMD instructions: AVX2_128
CPU FFT library: fftw-3.3.10-sse2-avx-avx2-avx2_128
GPU FFT library: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-2.6.0
Tracing support: disabled
C compiler: /fsx/spack/lib/spack/env/gcc/gcc GNU 7.3.1
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O2 -g -DNDEBUG
C++ compiler: /fsx/spack/lib/spack/env/gcc/g++ GNU 7.3.1
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O2 -g -DNDEBUG
Running on 4 nodes with total 384 cores, 384 processing units
Cores per node: 96
Logical processing units per node: -96 - 96
OS CPU Limit / recommended threads to start per node: 96
Hardware detected on host hpc6a-dy-hpc6a-48xlarge-23 (the node of MPI rank 0):
CPU info:
Vendor: AMD
Brand: AMD EPYC 7R13 Processor
Family: 25 Model: 1 Stepping: 1
Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3 x2apic
Hardware topology: Full, with devices
Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31] [ 32] [ 33] [ 34] [ 35] [ 36] [ 37] [ 38] [ 39] [ 40] [ 41] [ 42] [ 43] [ 44] [ 45] [ 46] [ 47]
Package 1: [ 48] [ 49] [ 50] [ 51] [ 52] [ 53] [ 54] [ 55] [ 56] [ 57] [ 58] [ 59] [ 60] [ 61] [ 62] [ 63] [ 64] [ 65] [ 66] [ 67] [ 68] [ 69] [ 70] [ 71] [ 72] [ 73] [ 74] [ 75] [ 76] [ 77] [ 78] [ 79] [ 80] [ 81] [ 82] [ 83] [ 84] [ 85] [ 86] [ 87] [ 88] [ 89] [ 90] [ 91] [ 92] [ 93] [ 94] [ 95]
CPU limit set by OS: -1 Recommended max number of threads: 96
Numa nodes:
Node 0 (99241918464 bytes mem): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Node 1 (99332898816 bytes mem): 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Node 2 (99332894720 bytes mem): 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
Node 3 (99331854336 bytes mem): 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
Latency:
0 1 2 3
0 1.00 1.20 3.20 3.20
1 1.20 1.00 3.20 3.20
2 3.20 3.20 1.00 1.20
3 3.20 3.20 1.20 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L2: 524288 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L3: 33554432 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
PCI devices:
0000:00:01.3 Id: 8086:7113 Class: 0x0000 Numa: -1
0000:00:03.0 Id: 1d0f:1111 Class: 0x0300 Numa: -1
0000:00:04.0 Id: 1d0f:8061 Class: 0x0108 Numa: -1
0000:00:05.0 Id: 1d0f:ec20 Class: 0x0200 Numa: -1
0000:00:06.0 Id: 1d0f:efa1 Class: 0x0200 Numa: -1
Highest SIMD level supported by all nodes in run: AVX2_256
SIMD instructions selected at compile time: AVX2_128
Compiled SIMD newer than supported; program might crash.
```
issue