Open MPI applications hang and/or crash occasionally on `aarch64/neoverse_v1`
Input fields:
Institution (optional): Amazon
EESSI Version (required): 2023.06: foss/2022a and perhaps foss/2023a
Machine architecture (required): neoverse_v1 (c7g instance types)
Operating system (required): Rocky 9 and also Ubuntu 2004+EESSI Pilot Apptainer
Software affected (optional): OpenFOAM, FFTW, ESPResSo
Each of the software packages above have experienced intermittent test failures including hangs, application validation failure, and segfaults. The common thread is that they use Open MPI 4.1.5 or 4.1.4, and the problem is unique to the neoverse_v1 builds.
I'm opening this issue to try to get to root cause which is likely in the Open MPI build.
Related issues:
- OpenFOAM: #24 (closed)
- ESPResSo: https://github.com/EESSI/software-layer/issues/363
- FFTW: https://github.com/EESSI/software-layer/issues/325 and https://github.com/FFTW/fftw3/issues/334
Upstream issue:
Edited by Luke Robison