Subprocesses in MPI contexts need more comprehensive environment handling.
Summary
Documentation and pytest configuration may be insufficient for some software environments.
Example
The problem encountered at https://gromacs.bioexcel.eu/t/gmxapi-runtimeerror-solvate-failed-in-spc-water-box-testing-fixture/6097 was due to heavy-handed environment replacement such as https://gitlab.com/gromacs/gromacs/-/blob/main/python_packaging/gmxapi/test/conftest.py#L167.
These environment replacements were an attempt at a simple fix to #4421 (closed), but the side effects were extreme and egregious. To resolve #4421 (closed) without affecting many unrelated use cases, we need a more delicate resolution that filters only MPI-related environment variables so that non-MPI-related environment variables remain intact, as expected.
Exact steps to reproduce
If the gmx*
command requires LD_LIBRARYPATH or environment variables other than PATH in order to function correctly,
- the
solvate
andgrompp
commands in the pytest fixtures will fail - the documentation about using the env keyword argument is insufficient
This is not always reproducible. It is most likely to appear with a shared object build in an environment where transitive dependencies are dynamically linked with reliance on transient runtime environment details. E.g. module load hwloc
introduces transitive dependencies that do not get included in the libgromacs rpath.
For developers: Why is this important?
- Configuring with
-DGMX_PYTHON_PACKAGE=ON
could causemake check
to fail in an affected software environment. - The combinatorics of factors involved could cause user confusion and will be difficult to diagnose in discussion on the forums.
- The env key word argument to
commandline_operation
will be used more, now that there are more use cases forgmxapi
+libgromacs_mpi
Background
When running a command line subprocess through gmxapi.commandline_operation()
from a script that was launched with MPI, we don't know whether the command is MPI-aware.
If it is not MPI aware, then running it on all ranks would duplicate work, and would require separate task working directories.
If it is MPI aware, then we either need to manipulate the environment so that it does not see the MPI context and can then behave well running on a single rank, or run it on all ranks with the same task working directories using some sort of launch method that can handle MPI already being initialized in the parent process (which probably still requires cleaning the environment and making a fresh mpiexec
(or equivalent) call).
We previously decided on a minimal change in which gmxapi.commandline_operation()
invokes a single subprocess and advised users to provide an appropriate environment.
If this is a bug, (1) what happens, and (2) what did you expect to happen?
The details discovered by CMake should be sufficient for wrapping command line operations for the installed GROMACS,
- for regular usage
- for the test fixtures
- in the build tree, and
- for regular
pip install
+pytest
use cases.
Instead, build and installation could appear to succeed, but make check
or run time usage could encounter strange errors ultimately related to mismatched/missing linking details affected by environment variables.
Relevant input files, logs and/or screenshots
Possible fixes
- Better environment manipulation
- Provide more elaboration in the docs and find a better subset of environment variables to copy for subprocesses in the test suite.
- Prune specific environment variables from the environment passed to the subprocess.
- Smarter invocation
- Expand
gmxapi.commandline_operation
to have a more complete expression of multiprocessing / HPC resource contexts. - Suppress or manipulate MPI initialization in command line tools that do not use MPI.
- Expand