Under an MPI context, `gmxapi.commandline_operation` has confusing errors for MPI-enabled CLI tools.
mpiexec python -m mpi4py my_gmxapi_script.py
may abort with confusing errors when my_gmxapi_script.py
includes commandline_operation
work using MPI-enabled executables.
Presumably this is because the tool sees the environment variables from mpiexec
and tries to do MPI_Init
, though mpi4py has already done so.
Steps to reproduce
- Build and install the
gmxapi
Python package against an MPI-enabled GROMACS installation. - Run the pytest test suite with MPI:
mpiexec -n 1 $(which python) -m mpi4py -m pytest /path/to/src/python_packaging/src/test
Alternative
- Check out commit c9ea0d1d
- Configure with
-DGMX_PYTHON_PACKAGE=ON
and MPI enabled. -
make gmxapi_pytest_mpi
fails becausegmx_mpi solvate...
can't succeed.
Details
A minimal reproduction of the problem can be performed by executing something like the following script with mpiexec
import subprocess
import mpi4py.MPI
if mpi4py.MPI.COMM_WORLD.Get_rank() == 0:
subprocess.run("gmx_mpi solvate -box 5 5 5 -p topology.top -o structure.gro".split())
or, more generally:
import os
import subprocess
import sys
import mpi4py.MPI
argv = (sys.executable, '-m', 'mpi4py', '-c',
'import mpi4py.MPI; print(mpi4py.MPI.COMM_WORLD.Get_rank())')
if mpi4py.MPI.COMM_WORLD.Get_rank() == 0:
subprocess.run(argv)
Example:
mpiexec -n 1 $(which python) -m mpi4py test.py
produces output containing something like the following.
...
ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
Removing the environment variables from the mpi launcher lets the CLI tool think it is running without an MPI context.
import os
import subprocess
import mpi4py.MPI
if mpi4py.MPI.COMM_WORLD.Get_rank() == 0:
subprocess.run("gmx_mpi solvate -box 5 5 5 -p topology.top -o structure.gro".split(), env={'PATH': os.getenv('PATH')})
or
import os
import subprocess
import sys
import mpi4py.MPI
argv = (sys.executable, '-m', 'mpi4py', '-c',
'import mpi4py.MPI; print(mpi4py.MPI.COMM_WORLD.Get_rank())')
if mpi4py.MPI.COMM_WORLD.Get_rank() == 0:
subprocess.run(
argv,
env={'PATH': os.getenv('PATH')},
)
Proposed solution
Prune the environment variables exposed to the command line executable subprocess. (This is the approach used by other frameworks like RADICAL Pilot.)
- Allow users to override the environment map inherited by the subprocess.
- Document the caveats and provide a suggested replacement environment map for users encountering this use case.
Rejected solution
We could change the default environment provided to gmxapi.commandline_operation
subprocesses without updating the interface, but this could cause a significant and confusing change in behavior between 0.3.1 and all previous versions.
Deferred
We cannot easily support multi-rank command line tasks in gmxapi workflows robustly in the current scheme, where we assume MPI is initialized and finalized at the outermost scope of the Python process. Instead, we should refine the execution management middleware layer, and defer to the native launch methods of third-party work load software for optional support of such tasks.
There may be updates we can suggest to developers of command line tools, or better workarounds that we can apply for specific MPI environments or tool designs, but that would be beyond the scope of a bug fix.
For the gmxapi tool that expresses tasks based on command line executables, future interfaces need to accommodate user options regarding the launch method and other environment details or resource requirements, but such work is planned for a new tool to be introduced in a Python package version greater than 0.3.x
Related issues
We do not have automated test coverage for this right now. I could move ahead on #3563 (closed) with some feedback and clear approval.
It turns out that this bug was originally described at #3086 (comment 310556500). The problem with wrapped command line tools under mpiexec
is sort of the inverse of the primary issue reported for #3086 (closed) (that MPI had not been initialized, which no longer seems to be a problem).