gmxapi cannot determine working directory on a second mdrun command

Summary

Using gmxapi to create a workflow with multiple grompp -> mdrun -> grompp -> mdrun commands fails with ApiError: Cannot determine working directory state

GROMACS version

GROMACS 2023 gmxapi 0.4.0

output to gmx_mpi -quiet -version:

:-) GROMACS - gmx_mpi, 2023 (-:

Executable:   /home/apeng/.local/gromacs_2023/bin/gmx_mpi
Data prefix:  /home/apeng/.local/gromacs_2023
Working dir:  /home/apeng/research/dmref/experiments/ap23-06/pmmaph/flow-runs/debug-runs
Command line:
  gmx_mpi -quiet -version

GROMACS version:    2023
Precision:          mixed
Memory model:       64 bit
MPI library:        MPI (GPU-aware: CUDA)
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:        CUDA
NB cluster size:    8
SIMD instructions:  AVX2_256
CPU FFT library:    fftw-3.3.8-sse2-avx-avx2-avx2_128
GPU FFT library:    cuFFT
Multi-GPU FFT:      none
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /home/apeng/mambaforge/envs/qdmref/bin/x86_64-conda-linux-gnu-cc GNU 10.4.0
C compiler flags:   -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -O3 -DNDEBUG
C++ compiler:       /home/apeng/mambaforge/envs/qdmref/bin/x86_64-conda-linux-gnu-c++ GNU 10.4.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library:       External - detected on the system
LAPACK library:     External - detected on the system
CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2022 NVIDIA Corporation;Built on Wed_Jun__8_16:49:14_PDT_2022;Cuda compilation tools, release 11.7, V11.7.99;Build cuda_11.7.r11.7/compiler.31442593_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_35,code=sm_35;--generate-code=arch=compute_37,code=sm_37;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver:        12.10
CUDA runtime:       11.70

Steps to reproduce

Run either python reproduce_error.py or use the notebook reproduce_error.ipynb in gmxapi_error.tar.bz2.

What is the current bug behavior?

The second mdrun fails with an ApiError.

What did you expect the correct behavior to be?

The second mdrun exits normally as with the first mdrun.

--- no log files provided with failed gmxapi commandline or mdrun calls --- The python traceback is provided as a replacement:

---------------------------------------------------------------------------
ApiError                                  Traceback (most recent call last)
/home/apeng/research/dmref/experiments/ap23-06/gmxapi_error/reproduce_error.ipynb Cell 6 in 1
----> 1 mdrun_cmd2.run()

File ~/mambaforge/envs/dmref/lib/python3.10/site-packages/gmxapi/simulation/mdrun.py:835, in StandardOperationHandle.run(self)
    834 def run(self):
--> 835     self.__resource_manager.update_output()

File ~/mambaforge/envs/dmref/lib/python3.10/site-packages/gmxapi/simulation/mdrun.py:743, in ResourceManager.update_output(self)
    719     raise exceptions.ProtocolError(
    720         "Bug detected: resource manager tried to execute operation twice."
    721     )
    722 with self.publishing_resources() as publishing_resources:
    723     # TODO: rewrite with the pattern that this block is directing and then resolving an operation in the
    724     #  operation's library/implementation context.
   (...)
    741     # ResourceManager and task requirements.
    742     # TODO: Dispatch/discover this resource factory from a canonical place.
--> 743     input = LegacyImplementationSubscription(self)
    744     # End of action of the InputResourceDirector[Context, MdRunSubscription].
    745     ###
    746 
    747     # We are giving the director a resource that contains the subscription
    748     # to the dispatched work.
    749     for member in range(self.ensemble_width):

File ~/mambaforge/envs/dmref/lib/python3.10/site-packages/gmxapi/simulation/mdrun.py:390, in LegacyImplementationSubscription.__init__(self, resource_manager)
    385         if not os.path.exists(file):
    386             logger.error(
    387                 f"Expected file {file} not found. gmxapi.mdrun task "
    388                 f"{resource_manager.operation_id} is in an unknown state. Aborting."
    389             )
--> 390             raise exceptions.ApiError(
    391                 f"Cannot determine working directory state: {workdir}"
    392             )
    393 else:
    394     # Build the working directory and input files.
    395     os.mkdir(workdir)

ApiError: Cannot determine working directory state: /home/apeng/research/dmref/experiments/ap23-06/gmxapi_error/mdrun_6d00f4b07eabc852c23b016d1c2ca339_i0_0

Possible fixes

Cannot determine working state is usually triggered when trying to rerun the same procedure in the same directory. It's possible that the second mdrun call tries to run in the same directory as the first mdrun working directory.