MPI abort upon certain geometry, possibly related to symmetrization

Hi,

I report (on behalf of user on our cluster) some rare case where some certain geometry leads to MPI_ABORT, without useful error information, specifically, it crashes before "Initial potential from superposition of free atoms". The input to reproduce this issue (a simple Mg slab) is attached here, pseudopotential doesn't seem to play any role here, just for the record is was done with Mg.pbesol-spnl-kjpaw_psl.1.0.0.UPF.

We noticed that 1. this only happens with sufficient MPI rank (crashing start from exactly 32 MPI tasks in our case) and 2. when this happens, a tweak of input file would fix the issue, e.g. changing the cell vector C from 36.0000 -> 36.0001

I performed a bisect to track the issue, and the problematic commit seems to be de6f4eff ; so it works fine in QE 7.1 and breaks since 7.2. In combination with the observation it looks much like something went wrong in the symmetrization code, though I don't have the expertise to suggest a fix.

Normal vs crashed log (stdout+stderr) with mpirun -np 32 ../q-e/build/bin/pw.x -inp qe-debug.in:

...
      Estimated max dynamical RAM per process >       1.44 GB

      Estimated total dynamical RAM >      46.23 GB
---------------------------------------------------------------------------
-MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
-  Proc: [[25734,1],11]
-  Errorcode: 1
-
-NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
-You may or may not see output from other processes, depending on
-exactly when Open MPI kills them.
---------------------------------------------------------------------------
---------------------------------------------------------------------------
-prterun has exited due to process rank 11 with PID 0 on node vera1 calling
-"abort". This may have caused other processes in the application to be
-terminated by signals sent by prterun (as reported here).
---------------------------------------------------------------------------
+
+     Check: negative core charge=   -0.000006
+
+     Initial potential from superposition of free atoms
+
+     starting charge    1249.8865, renormalised to    1250.0000
...

Other details:

  • Hardware: AMD EPYC 9354 32-Core Processor
  • OS: Rocky Linux 9.4 (Blue Onyx)
  • Compilers and libs: GCC/13.3.0+OpenMPI/5.0.3+Flexiblas/3.4.4+Scalapack/2.2.20+FFTW/3.3.10, configured with cmake -DQE_ENABLE_SCALAPACK=ON, cmake log here
Edited by Yunqi Shao