Skip to content

Cancel Nextflow run (LSF)

A running (queued, running, etc.) Nextflow run needs to be cancelable.

Ensure that

  • Nextflow terminates
  • Cluster jobs (LSF for OTP) are cancelled by Nextflow.

Note

Manager.cancel is probably incorrectly implemented. Currently, it kills the worker with using revoke(terminate=True, signal="SIGKILL"). According to the Celery docs this should not be done, because it is possible that the worker started working on another task in the meantime! What needs to be killed is the workflow executed by the worker. Furthermore, this seems to kill the celery_session_worker in the tests such that following tests remain queued and time out. So with this approach the tests must not use a celery_session_worker (or have multiple of them).

Probably the correct way is this

  1. Currently, the revoke can be done in the QUEUED state. The revocation message is not persisted by default, but stored in-memory by the worker. This means that after a restart of Celery the already revoked tasks may start! Persistent revokes should be turned used.
  2. If the job is already running and the workflow should be killed gracefully, then it might (very tentatively) be possible to send another signal to the worker, e.g. like "SIGUSR2" ("SIGUSR1" is reserved for timeouts). In the worker, the signal should be received, which should then trigger a worker-internal logic to gracefully terminate the subprocess of the workflow manager. For a subprocess this will probably first a "SIGINT" (user interrupt, like CTRL-C). Thus the workflow manager has time to clean up, stop running cluster jobs, indicate in the logs, that it was interrupted. etc. Furthermore, a check needs to be implemented in the worker, that the subprocess to kill indeed is the workflow run that is requested to be terminated! The return-value of the command-task will then be whatever is returned by the subprocess (probably !=0, but maybe a check is necessary)
  3. The Manager.update_run_results does only deal with the RunStatus.COMPLETE situation, but not with the RunState.CANCELLED situation. Fix this.

Maybe we can also use workers that always only execute a single task?

Edited by Philip Reiner Kensche
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information