Distributed-memory algorithm for ndiag (3*3 and up) causes MPI_Comm_free Error
I am using QE v7.1 on a HPC system and have encountered a rather curious error.
When I use the -ndiag flag to run pw.x (e.g. mpiexec -np 32 ./pw.x -ndiag 4 -i pw.inp) I get the following error only if I use -ndiag 9 or higher. I did the following tests (where the input file and -np were kept the same):
-
-ndiag 1-> "a serial algorithm will be used" -> no error -
-ndiag 4-> "custom distributed-memory algorithm (size of sub-group: 2*2 procs)" -> no error -
-ndiag 9-> "custom distributed-memory algorithm (size of sub-group: 3*3 procs)" -> error (see below) -
-ndiag 16-> "custom distributed-memory algorithm (size of sub-group: 4*4 procs)" -> error (see below)
The error traceback:
Fatal error in PMPI_Comm_free: Invalid communicator, error stack:
PMPI_Comm_free(145): MPI_Comm_free(comm=0x7fff64d69194) failed
PMPI_Comm_free(93).: Null communicator
It's puzzling how 2*2 works fine but 3*3 and up causes the error.
Any help would be highly appreciated or suggestions on what else to test.
Thank you!
-Peter
P.S.: The only thing I could find online that seems to be a similar (or the same) issues was this: link, but no solution was provided there.
Edited by Peter Schindler