Highly parallel GROMACS 2022 simulations quickly crash

Summary

Using the benchMEM, benchRIB, or benchPEP input files at https://www.mpinat.mpg.de/grubmueller/bench with GROMACS 2022 in highly parallel scenarios (e.g. on two or more nodes with each 96 cores) leads to crashes within the first ~400 to 50,000 time steps, producing step*.pdb output files and error messages of kind

step 400: One or more water molecules can not be settled.
Atom 73864 moved more than the distance allowed by the domain decomposition (0.825614) in direction Y

The same .tprs run fine on identical hardware when GROMACS 2020 or 2021 is used.

GROMACS version

2022 2022.1

Steps to reproduce

In my hands the bug shows up with any of the following input files: https://www.mpinat.mpg.de/benchMEM https://www.mpinat.mpg.de/benchRIB https://www.mpinat.mpg.de/benchPEP

Whereas with v 2021 all runs are fine, with v 2022 some of the runs abort, with the possibility for crashes being higher the larger the number of MPI ranks. Currently running on 96 core AMD nodes, using various possibilities N_MPI x N_OpenMP = 96 * N_nodes

 1 node  -> 48 simulations out of a total of 48 run fine
 2 nodes -> only 43 out of 48 run fine, rest aborts
 4 nodes -> only 32 of 48 run fine
 8 nodes -> only 19 of 48 run fine
64 nodes -> only 6 of 32 run fine

Behavior can be reproduced with gcc-7.3.1 +zen and gcc-10.3.1 +zen3 optimizations enabled.

GROMACS:      gmx mdrun, version 2022.1-spack
Executable:   /fsx/spack/opt/spack/linux-amzn2-zen/gcc-7.3.1/gromacs-2022.1-tinuujqkxqqivlzwjxioqacrgakfu2jv/bin/gmx_mpi
Data prefix:  /fsx/spack/opt/spack/linux-amzn2-zen/gcc-7.3.1/gromacs-2022.1-tinuujqkxqqivlzwjxioqacrgakfu2jv
Working dir:  /fsx/aws_benchmark/bench_scaling/hpc6a/Gromacs-2022.1-gcc7.3.1-zen/n04
Process ID:   8862
Command line:
  gmx_mpi mdrun -npme 0 -pin on -s /home/ec2-user/tpr/benchMEM.tpr -nsteps 50000 -resethway -noconfout -cpt 1440 -deffnm benchMEM.nodes4.ranks48.threads2.run01 -dlb no -ntomp 2

GROMACS version:    2022.1-spack
Precision:          mixed
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        disabled
SIMD instructions:  AVX2_128
CPU FFT library:    fftw-3.3.10-sse2-avx-avx2-avx2_128
GPU FFT library:    none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-2.6.0
Tracing support:    disabled
C compiler:         /fsx/spack/lib/spack/env/gcc/gcc GNU 7.3.1
C compiler flags:   -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O2 -g -DNDEBUG
C++ compiler:       /fsx/spack/lib/spack/env/gcc/g++ GNU 7.3.1
C++ compiler flags: -mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O2 -g -DNDEBUG


Running on 4 nodes with total 384 cores, 384 processing units
  Cores per node:           96
  Logical processing units per node:   -96 - 96
  OS CPU Limit / recommended threads to start per node:   96
Hardware detected on host hpc6a-dy-hpc6a-48xlarge-23 (the node of MPI rank 0):
  CPU info:
    Vendor: AMD
    Brand:  AMD EPYC 7R13 Processor
    Family: 25   Model: 1   Stepping: 1
    Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3 x2apic
  Hardware topology: Full, with devices
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  0: [   0] [   1] [   2] [   3] [   4] [   5] [   6] [   7] [   8] [   9] [  10] [  11] [  12] [  13] [  14] [  15] [  16] [  17] [  18] [  19] [  20] [  21] [  22] [  23] [  24] [  25] [  26] [  27] [  28] [  29] [  30] [  31] [  32] [  33] [  34] [  35] [  36] [  37] [  38] [  39] [  40] [  41] [  42] [  43] [  44] [  45] [  46] [  47]
      Package  1: [  48] [  49] [  50] [  51] [  52] [  53] [  54] [  55] [  56] [  57] [  58] [  59] [  60] [  61] [  62] [  63] [  64] [  65] [  66] [  67] [  68] [  69] [  70] [  71] [  72] [  73] [  74] [  75] [  76] [  77] [  78] [  79] [  80] [  81] [  82] [  83] [  84] [  85] [  86] [  87] [  88] [  89] [  90] [  91] [  92] [  93] [  94] [  95]
    CPU limit set by OS: -1   Recommended max number of threads: 96
    Numa nodes:
      Node  0 (99241918464 bytes mem):   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23
      Node  1 (99332898816 bytes mem):  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47
      Node  2 (99332894720 bytes mem):  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
      Node  3 (99331854336 bytes mem):  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95
      Latency:
               0     1     2     3
         0  1.00  1.20  3.20  3.20
         1  1.20  1.00  3.20  3.20
         2  3.20  3.20  1.00  1.20
         3  3.20  3.20  1.20  1.00
    Caches:
      L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
      L2: 524288 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
      L3: 33554432 bytes, linesize 64 bytes, assoc. 16, shared 8 ways
    PCI devices:
      0000:00:01.3  Id: 8086:7113  Class: 0x0000  Numa: -1
      0000:00:03.0  Id: 1d0f:1111  Class: 0x0300  Numa: -1
      0000:00:04.0  Id: 1d0f:8061  Class: 0x0108  Numa: -1
      0000:00:05.0  Id: 1d0f:ec20  Class: 0x0200  Numa: -1
      0000:00:06.0  Id: 1d0f:efa1  Class: 0x0200  Numa: -1

Highest SIMD level supported by all nodes in run: AVX2_256
SIMD instructions selected at compile time:       AVX2_128
Compiled SIMD newer than supported; program might crash.

Edited Apr 26, 2022 by Mark Abraham