Debugging Pipeline Stoppage

This is the handoff ticket aggregating info and work for @mattbrown7's pipeline issue.

Preconfiguration

source repository: https://gitlab.com/hoffman-lab/core
branch: git checkout mbrown/sm-incubator-tweaks
- Josh: your development should be on a separate branch
environment setup:

cd core/cmd/client-apis/python/agents/sm_incubator
micromamba create -f sm-env.yaml
micromamba activate sm-core

Install most up-to-date smcore pip package:

micromamba activate sm-core
cd core/cmd/client-apis/python/pkg
pip install .

Data:

I've relocated the test data to: gpu-1:/mnt/data/stoppage-repro

This is a shared drive hosted on BB-1 so if network is an issue, feel free to make a local copy.

The commands below assumes you're running against the shared copy (update accordingly if needed.)

Path to original original dataset: /home/matt/core/cmd/client-apis/python/agents/sm_incubator/input

Observed

When using all agents in config.yaml, combined with all cases listed in the input CSV (~180), the pipeline will hang around case/output number 40. All agents eventually stop sending messages and hang indefinitely.

Expected

All results should be produced for all input data.

KG Setup and Data Injection

All commands assume the working dir is: core/cmd/client-apis/python/agents/sm_incubator

This directory in your local repo needs to be initialized (this may be new to you, Josh)

core init
core config set blackboard.addr bb-1.heph.com:8081

starting the controller:

core start mas sm_controller.py config.yaml

running upload:

python dataset_upload_agent/dataset_upload.py /mnt/data/stoppage-repro/lung_right_chest_xr.csv nifti

It's at this point that processing will continue until everything is complete, or the error is encountered.

Other things to note

When the hang occurs

Blackboard will still accept messages via core control post hello or other new agents connecting. This suggests (but does not guarantee):
- BB itself is prob not the issue
- Issue could lie in python API
Adding min-max agent seems to be the trigger point, but as we saw today, that's not a guarantee. I'd recommend incrementally adding agents until you see a failure. Let me know if you're unable to reproduce though.