Skip to content

add UnkillableStepTimeout parameter (slurm.conf)

David Benaben requested to merge unkillablesteptimeout into master

Suite au lenteur sur le stockage certains jobs mettent mettent du temps à se terminer

[...] du côté du noeud cpu-node-48, le cleanup a duré 90 secondes sur un job. Ici on a le jobid 17164916 :

[2021-06-16T20:11:05.619] sched: _slurm_rpc_allocate_resources JobId=17164916 NodeList=(null) usec=339
[2021-06-16T20:11:05.785] sched: Allocate JobId=17164916 NodeList=cpu-node-48 #CPUs=1 Partition=fast
[2021-06-16T20:15:04.074] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=17164916 uid 100413
[2021-06-16T20:15:04.076] job_signal: 9 of running JobId=17164916 successful 0x8004
[2021-06-16T20:15:36.000] _slurm_rpc_complete_job_allocation: JobId=17164916 error Job/step already completing or completed
[2021-06-16T20:16:20.655] Resending TERMINATE_JOB request JobId=17164916 Nodelist=cpu-node-48
[2021-06-16T20:16:34.000] update_node: node cpu-node-48 reason set to: Kill task failed
[2021-06-16T20:16:34.000] update_node: node cpu-node-48 state set to DRAINING
[2021-06-16T20:16:34.794] cleanup_completing: JobId=17164916 completion process took 90 seconds

MR pour pouvoir modifuer la valeur de UnkillableStepTimeout

https://slurm.schedmd.com/slurm.conf.html

UnkillableStepTimeout

The length of time, in seconds, that Slurm will wait before deciding that processes in a job step are unkillable (after they have been signaled with SIGKILL) and execute UnkillableStepProgram. The default timeout value is 60 seconds. If exceeded, the compute node will be drained to prevent future jobs from being scheduled on the node.

Merge request reports