Highly parallel GROMACS 2022 simulations quickly crash
Summary
Using the benchMEM, benchRIB, or benchPEP input files at https://www.mpinat.mpg.de/grubmueller/bench with GROMACS 2022 in highly parallel scenarios (e.g. on two or more nodes with each 96 cores) leads to crashes within the first ~400 to 50,000 time steps, producing step*.pdb output files and error messages of kind
- step 400: One or more water molecules can not be settled.
- Atom 73864 moved more than the distance allowed by the domain decomposition (0.825614) in direction Y
The same .tprs run fine on identical hardware when GROMACS 2020 or 2021 is used.
GROMACS version
2022 2022.1
Steps to reproduce
In my hands the bug shows up with any of the following input files: https://www.mpinat.mpg.de/benchMEM https://www.mpinat.mpg.de/benchRIB https://www.mpinat.mpg.de/benchPEP
Whereas with v 2021 all runs are fine, with v 2022 some of the runs abort, with the possibility for crashes being higher the larger the number of MPI ranks. Currently running on 96 core AMD nodes, using various possibilities N_MPI x N_OpenMP = 96 * N_nodes
1 node -> 48 simulations out of a total of 48 run fine
2 nodes -> only 43 out of 48 run fine, rest aborts
4 nodes -> only 32 of 48 run fine
8 nodes -> only 19 of 48 run fine
64 nodes -> only 6 of 32 run fine
Behavior can be reproduced with gcc-7.3.1 +zen and gcc-10.3.1 +zen3 optimizations enabled.
GROMACS: gmx mdrun, version 2022.1-spack
Executable: /fsx/spack/opt/spack/linux-amzn2-zen/gcc-7.3.1/gromacs-2022.1-tinuujqkxqqivlzwjxioqacrgakfu2jv/bin/gmx_mpi
Data prefix: /fsx/spack/opt/spack/linux-amzn2-zen/gcc-7.3.1/gromacs-2022.1-tinuujqkxqqivlzwjxioqacrgakfu2jv
Working dir: /fsx/aws_benchmark/bench_scaling/hpc6a/Gromacs-2022.1-gcc7.3.1-zen/n04
Process ID: 8862
Command line:
gmx_mpi mdrun -npme 0 -pin on -s /home/ec2-user/tpr/benchMEM.tpr -nsteps 50000 -resethway -noconfout -cpt 1440 -deffnm benchMEM.nodes4.ranks48.threads2.run01 -dlb no -ntomp 2
GROMACS version: 2022.1-spack
Precision: mixed
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: disabled
SIMD instructions: AVX2_128
CPU FFT library: fftw-3.3.10-sse2-avx-avx2-avx2_128
GPU FFT library: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-2.6.0
Tracing support: disabled
C compiler: /fsx/spack/lib/spack/env/gcc/gcc GNU 7.3.1
C compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O2 -g -DNDEBUG
C++ compiler: /fsx/spack/lib/spack/env/gcc/g++ GNU 7.3.1
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O2 -g -DNDEBUG
Running on 4 nodes with total 384 cores, 384 processing units
Cores per node: 96
Logical processing units per node: -96 - 96
OS CPU Limit / recommended threads to start per node: 96
Hardware detected on host hpc6a-dy-hpc6a-48xlarge-23 (the node of MPI rank 0):
CPU info:
Vendor: AMD
Brand: AMD EPYC 7R13 Processor
Family: 25 Model: 1 Stepping: 1
Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3 x2apic
Hardware topology: Full, with devices
Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] [ 24] [ 25] [ 26] [ 27] [ 28] [ 29] [ 30] [ 31] [ 32] [ 33] [ 34] [ 35] [ 36] [ 37] [ 38] [ 39] [ 40] [ 41] [ 42] [ 43] [ 44] [ 45] [ 46] [ 47]
Package 1: [ 48] [ 49] [ 50] [ 51] [ 52] [ 53] [ 54] [ 55] [ 56] [ 57] [ 58] [ 59] [ 60] [ 61] [ 62] [ 63] [ 64] [ 65] [ 66] [ 67] [ 68] [ 69] [ 70] [ 71] [ 72] [ 73] [ 74] [ 75] [ 76] [ 77] [ 78] [ 79] [ 80] [ 81] [ 82] [ 83] [ 84] [ 85] [ 86] [ 87] [ 88] [ 89] [ 90] [ 91] [ 92] [ 93] [ 94] [ 95]
CPU limit set by OS: -1 Recommended max number of threads: 96
Numa nodes:
Node 0 (99241918464 bytes mem): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Node 1 (99332898816 bytes mem): 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Node 2 (99332894720 bytes mem): 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
Node 3 (99331854336 bytes mem): 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
Latency:
0 1 2 3
0 1.00 1.20 3.20 3.20
1 1.20 1.00 3.20 3.20
2 3.20 3.20 1.00 1.20
3 3.20 3.20 1.20 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L2: 524288 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L3: 33554432 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
PCI devices:
0000:00:01.3 Id: 8086:7113 Class: 0x0000 Numa: -1
0000:00:03.0 Id: 1d0f:1111 Class: 0x0300 Numa: -1
0000:00:04.0 Id: 1d0f:8061 Class: 0x0108 Numa: -1
0000:00:05.0 Id: 1d0f:ec20 Class: 0x0200 Numa: -1
0000:00:06.0 Id: 1d0f:efa1 Class: 0x0200 Numa: -1
Highest SIMD level supported by all nodes in run: AVX2_256
SIMD instructions selected at compile time: AVX2_128
Compiled SIMD newer than supported; program might crash.