gmxapi cannot determine working directory on a second mdrun command
Summary
Using gmxapi to create a workflow with multiple grompp -> mdrun -> grompp -> mdrun commands fails with ApiError: Cannot determine working directory state
GROMACS version
GROMACS 2023 gmxapi 0.4.0
output to gmx_mpi -quiet -version
:
:-) GROMACS - gmx_mpi, 2023 (-:
Executable: /home/apeng/.local/gromacs_2023/bin/gmx_mpi
Data prefix: /home/apeng/.local/gromacs_2023
Working dir: /home/apeng/research/dmref/experiments/ap23-06/pmmaph/flow-runs/debug-runs
Command line:
gmx_mpi -quiet -version
GROMACS version: 2023
Precision: mixed
Memory model: 64 bit
MPI library: MPI (GPU-aware: CUDA)
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NB cluster size: 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /home/apeng/mambaforge/envs/qdmref/bin/x86_64-conda-linux-gnu-cc GNU 10.4.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -O3 -DNDEBUG
C++ compiler: /home/apeng/mambaforge/envs/qdmref/bin/x86_64-conda-linux-gnu-c++ GNU 10.4.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External - detected on the system
LAPACK library: External - detected on the system
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2022 NVIDIA Corporation;Built on Wed_Jun__8_16:49:14_PDT_2022;Cuda compilation tools, release 11.7, V11.7.99;Build cuda_11.7.r11.7/compiler.31442593_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_35,code=sm_35;--generate-code=arch=compute_37,code=sm_37;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver: 12.10
CUDA runtime: 11.70
Steps to reproduce
Run either python reproduce_error.py
or use the notebook reproduce_error.ipynb
in gmxapi_error.tar.bz2.
What is the current bug behavior?
The second mdrun fails with an ApiError.
What did you expect the correct behavior to be?
The second mdrun exits normally as with the first mdrun.
--- no log files provided with failed gmxapi commandline or mdrun calls --- The python traceback is provided as a replacement:
---------------------------------------------------------------------------
ApiError Traceback (most recent call last)
/home/apeng/research/dmref/experiments/ap23-06/gmxapi_error/reproduce_error.ipynb Cell 6 in 1
----> 1 mdrun_cmd2.run()
File ~/mambaforge/envs/dmref/lib/python3.10/site-packages/gmxapi/simulation/mdrun.py:835, in StandardOperationHandle.run(self)
834 def run(self):
--> 835 self.__resource_manager.update_output()
File ~/mambaforge/envs/dmref/lib/python3.10/site-packages/gmxapi/simulation/mdrun.py:743, in ResourceManager.update_output(self)
719 raise exceptions.ProtocolError(
720 "Bug detected: resource manager tried to execute operation twice."
721 )
722 with self.publishing_resources() as publishing_resources:
723 # TODO: rewrite with the pattern that this block is directing and then resolving an operation in the
724 # operation's library/implementation context.
(...)
741 # ResourceManager and task requirements.
742 # TODO: Dispatch/discover this resource factory from a canonical place.
--> 743 input = LegacyImplementationSubscription(self)
744 # End of action of the InputResourceDirector[Context, MdRunSubscription].
745 ###
746
747 # We are giving the director a resource that contains the subscription
748 # to the dispatched work.
749 for member in range(self.ensemble_width):
File ~/mambaforge/envs/dmref/lib/python3.10/site-packages/gmxapi/simulation/mdrun.py:390, in LegacyImplementationSubscription.__init__(self, resource_manager)
385 if not os.path.exists(file):
386 logger.error(
387 f"Expected file {file} not found. gmxapi.mdrun task "
388 f"{resource_manager.operation_id} is in an unknown state. Aborting."
389 )
--> 390 raise exceptions.ApiError(
391 f"Cannot determine working directory state: {workdir}"
392 )
393 else:
394 # Build the working directory and input files.
395 os.mkdir(workdir)
ApiError: Cannot determine working directory state: /home/apeng/research/dmref/experiments/ap23-06/gmxapi_error/mdrun_6d00f4b07eabc852c23b016d1c2ca339_i0_0
Possible fixes
Cannot determine working state
is usually triggered when trying to rerun the same procedure in the same directory. It's possible that the second mdrun call tries to run in the same directory as the first mdrun working directory.