Debugging Pipeline Stoppage
This is the handoff ticket aggregating info and work for @mattbrown7's pipeline issue.
Preconfiguration
- source repository: https://gitlab.com/hoffman-lab/core
- branch:
git checkout mbrown/sm-incubator-tweaks- Josh: your development should be on a separate branch
- environment setup:
cd core/cmd/client-apis/python/agents/sm_incubator
micromamba create -f sm-env.yaml
micromamba activate sm-core
- Install most up-to-date smcore pip package:
micromamba activate sm-core
cd core/cmd/client-apis/python/pkg
pip install .
Data:
I've relocated the test data to: gpu-1:/mnt/data/stoppage-repro
This is a shared drive hosted on BB-1 so if network is an issue, feel free to make a local copy.
The commands below assumes you're running against the shared copy (update accordingly if needed.)
Path to original original dataset: /home/matt/core/cmd/client-apis/python/agents/sm_incubator/input
Observed
When using all agents in config.yaml, combined with all cases listed in the input CSV (~180), the pipeline will hang around case/output number 40. All agents eventually stop sending messages and hang indefinitely.
Expected
All results should be produced for all input data.
KG Setup and Data Injection
All commands assume the working dir is: core/cmd/client-apis/python/agents/sm_incubator
This directory in your local repo needs to be initialized (this may be new to you, Josh)
core init
core config set blackboard.addr bb-1.heph.com:8081
starting the controller:
core start mas sm_controller.py config.yaml
running upload:
python dataset_upload_agent/dataset_upload.py /mnt/data/stoppage-repro/lung_right_chest_xr.csv nifti
It's at this point that processing will continue until everything is complete, or the error is encountered.
Other things to note
When the hang occurs
- Blackboard will still accept messages via
core control post helloor other new agents connecting. This suggests (but does not guarantee):- BB itself is prob not the issue
- Issue could lie in python API
- Adding min-max agent seems to be the trigger point, but as we saw today, that's not a guarantee. I'd recommend incrementally adding agents until you see a failure. Let me know if you're unable to reproduce though.