SRun only uses a single core by default
Summary
When submitting a ml model training with the mantik compute backend API, only one core is used to execute the job by default. This behaviour cannot be overwritten in our backend config.
Steps to reproduce
- Submit a run to JUWELS with code that uses parallelization
- Monitor performance with llview
What is the current bug behavior?
Only a single core is being used for training of a parallelized model.
Details
There has been a change to the behaviour of srun
command:
Information - New Slurm version 22.05!
On 28. Feb Slurm has been upgraded to version 22.05.
Important changes from 21.08 to 22.05:
- srun will no longer read in SLURM_CPUS_PER_TASK and will not inherit option
--cpus-per-task from sbatch! This means you will explicitly have to specify
--cpus-per-task to your srun calls, or set the new SRUN_CPUS_PER_TASK env
var. If you want to keep using --cpus-per-task with sbatch then you will
have to add: "export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}".
- Using the option --cpus-per-task in 22.05 does imply --exact, which
means that each step with --cpus-per-task will now only get the minimum
number of cores. The pinning will change (implication on the performance)
and the tasks will fill the HW threads of same cores. If you don’t use SMT
and want to keep old behavior as before where your threads run only on
real cores then add this to srun: "--threads-per-core=1".
(from JUWELS ssh welcome message)
We create a batch script that then uses srun singularity run ...
. At the moment there is no way to pass flags to srun
.
Relevant logs and/or screenshots
Logs
PASTE LOGS HERE
What is the expected correct behavior?
-
srun
flags can be passed - parallelized models make use of multiple cores on JUWELS
Possible fixes
- Extend backend config to include
srun
flags - Add those flags when building the batch script for UNICORE