task assignment issue in PME decomposition
The following discussion from !3313 (merged) should be addressed:
-
@pszilard started a discussion: (+4 comments) I'm having some issues rerunning all unit tests (the cluster broke slurm + Openmpi so all binaries launched without mpirun fail), but all MPI tests pass except the following:
$ GMX_FORCE_UPDATE_DEFAULT_GPU=1 GMX_FORCE_GPU_AWARE_MPI=1 GMX_ENABLE_DIRECT_GPU_COMM=1 GMX_GPU_PME_DECOMPOSITION=1 ctest -R MdrunMpi2RankPmeTests --output-on-failure [...] Opened /mnt/beegfs/home/spall/gromacs/gromacs-main/build_hipsycl-dev_rocm5.3.0_openmpi-4.1.4-rocm-llvm_heffte_RE/src/programs/mdrun/tests/Testing/Temporary/ReproducesEnergies_PmeTest_Runs_basic_mdrun_notunepme_npme_0_pme_auto.edr as single precision energy file Last energy frame read 20 time 0.020 [ OK ] ReproducesEnergies/PmeTest.Runs/basic_mdrun_notunepme_npme_0_pme_auto (368 ms) [ RUN ] ReproducesEnergies/PmeTest.Runs/basic_mdrun_notunepme_npme_0_pme_gpu_pmefft_cpu Reading file /mnt/beegfs/home/spall/gromacs/gromacs-main/build_hipsycl-dev_rocm5.3.0_openmpi-4.1.4-rocm-llvm_heffte_RE/src/programs/mdrun/tests/Testing/Temporary/ReproducesEnergies_PmeTest_basic.tpr, VERSION 2023-rc1-dev-20221212-127720010e (single precision) This run has forced use of 'GPU-aware MPI'. However, GROMACS cannot determine if underlying MPI is GPU-aware. GROMACS recommends use of latest OpenMPI version for GPU-aware support. If you observe failures at runtime, try unsetting the GMX_FORCE_GPU_AWARE_MPI environment variable. GMX_ENABLE_DIRECT_GPU_COMM environment variable detected, enabling direct GPU communication using GPU-aware MPI. This run will default to '-update gpu' as requested by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable. This run has requested the 'GPU PME decomposition' feature, enabled by the GMX_GPU_PME_DECOMPOSITION environment variable. PME decomposition lacks substantial testing and should be used with caution. Can not increase nstlist because an NVE ensemble is used ------------------------------------------------------- Program: mdrun-mpi-pme-test, version 2023-rc1-dev-20221212-127720010e Source file: src/gromacs/taskassignment/taskassignment.cpp (line 129) Function: std::vector<GpuTaskAssignment> gmx::(anonymous namespace)::buildTaskAssignment(const gmx::GpuTasksOnRanks &, ArrayRef<const int>) MPI rank: 0 (out of 2) Error in user input: The GPU task assignment requested mdrun to use more than one GPU device on a rank, which is not supported. Request only one GPU device per rank. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors ------------------------------------------------------- -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [mun-node-14:588977] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init [mun-node-14:588977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 0% tests passed, 1 tests failed out of 1 Label Time Summary: GTest = 5.19 sec*proc (1 test) IntegrationTest = 5.19 sec*proc (1 test) MpiTest = 5.19 sec*proc (1 test) SlowGpuTest = 5.19 sec*proc (1 test) Total Test time (real) = 5.29 sec The following tests FAILED: 65 - MdrunMpi2RankPmeTests (Failed)
Full output is here: https://termbin.com/yc4xv