Instability of petsc4py init on HPC
I have installed petsc4py
multiple times on CentOS; most recently on Nov 8, 2023 at CentOS 8.3.2011 with anaconda3 directly via conda install -c conda-forge petsc=*=*real* petsc4py
and also with petsc=*=*complex*
. There seems to be an instability of initiating petsc4py when run with many processors via mpirun. When I call a python file with
import petsc4py
petsc4py.init()
via mpirun (mpich) with 150+ processors spread on multiple nodes on an sbatch shell file with SLURM, sometimes it runs with no problems, sometimes it doesn't output anything but times out (I have accidentally run 15 hours before with no further output). The chances of it running without problems decreases with increasing number of processors. On a previous install, I've also seen these outputs sometimes (though usually it just times out):
Abort(680602255) on node 212 (rank 212 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(84).......................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_POSIX_mpi_bcast(224).................:
MPIR_Bcast_impl(444).......................:
MPIR_Bcast_allcomm_auto(370)...............:
MPIR_Bcast_intra_binomial(105).............:
MPIC_Recv(187).............................:
MPIC_Wait(64)..............................:
MPIR_Wait_state(886).......................:
MPID_Progress_wait(335)....................:
MPIDI_progress_test(158)...................:
MPIDI_OFI_handle_cq_error(625).............: OFI poll failed (ofi_events.c:627:MPIDI_OFI_handle_cq_error:Input/output error)
and
[255]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[255]PETSC ERROR: General MPI error
[255]PETSC ERROR: MPI error 1 Invalid buffer pointer
[255]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[255]PETSC ERROR: Petsc Release Version 3.19.5, Aug 30, 2023
[255]PETSC ERROR: Unknown Name on a named n553 by hmmak Fri Oct 27 20:06:36 2023
[255]PETSC ERROR: Configure options AR=${PREFIX}/bin/x86_64-conda-linux-gnu-ar CC=mpicc CXX=mpicxx FC=mpifort CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /tmp_user/sator/hmmak/conda/envs/fx-real/include " CPPFLAGS="-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /tmp_user/sator/hmmak/conda/envs/fx-real/include" CXXFLAGS="-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /tmp_user/sator/hmmak/conda/envs/fx-real/include " FFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /tmp_user/sator/hmmak/conda/envs/fx-real/include -Wl,--no-as-needed" LDFLAGS="-pthread -fopenmp -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/tmp_user/sator/hmmak/conda/envs/fx-real/lib -Wl,-rpath-link,/tmp_user/sator/hmmak/conda/envs/fx-real/lib -L/tmp_user/sator/hmmak/conda/envs/fx-real/lib -Wl,-rpath-link,/tmp_user/sator/hmmak/conda/envs/fx-real/lib" LIBS="-lmpifort -lgfortran" --COPTFLAGS=-O3 --CXXOPTFLAGS=-O3 --FOPTFLAGS=-O3 --with-clib-autodetect=0 --with-cxxlib-autodetect=0 --with-fortranlib-autodetect=0 --with-debugging=0 --with-blas-lib=libblas.so --with-lapack-lib=liblapack.so --with-yaml=1 --with-hdf5=1 --with-fftw=1 --with-hwloc=0 --with-hypre=1 --with-metis=1 --with-mpi=1 --with-mumps=1 --with-parmetis=1 --with-pthread=1 --with-ptscotch=1 --with-shared-libraries --with-ssl=0 --with-scalapack=1 --with-superlu=1 --with-superlu_dist=1 --with-suitesparse=1 --with-x=0 --with-scalar-type=real --prefix=/tmp_user/sator/hmmak/conda/envs/fx-real
[255]PETSC ERROR: #1 PetscWorldIsSingleHost() at /home/conda/feedstock_root/build_artifacts/petsc_1693479875318/work/src/sys/utils/pdisplay.c:92
[255]PETSC ERROR: #2 PetscSetDisplay() at /home/conda/feedstock_root/build_artifacts/petsc_1693479875318/work/src/sys/utils/pdisplay.c:112
[255]PETSC ERROR: #3 PetscOptionsCheckInitial_Private() at /home/conda/feedstock_root/build_artifacts/petsc_1693479875318/work/src/sys/objects/init.c:334
[255]PETSC ERROR: #4 PetscInitialize_Common() at /home/conda/feedstock_root/build_artifacts/petsc_1693479875318/work/src/sys/objects/pinit.c:986
[255]PETSC ERROR: #5 PetscInitialize() at /home/conda/feedstock_root/build_artifacts/petsc_1693479875318/work/src/sys/objects/pinit.c:1277
Traceback (most recent call last):
File "/tmp_user/sator/hmmak/fenicsx-scripts/3D/fenicsx_init.py", line 33, in <module>
petsc4py.init()
File "/tmp_user/sator/hmmak/conda/envs/fx-real/lib/python3.11/site-packages/petsc4py/__init__.py", line 44, in init
PETSc._initialize(args, comm)
File "petsc4py/PETSc/PETSc.pyx", line 568, in petsc4py.PETSc._initialize
File "petsc4py/PETSc/PETSc.pyx", line 461, in petsc4py.PETSc.initialize
petsc4py.PETSc.Error: error code 98
[WARNING] yaksa: 9 leaked handle pool objects