Slow Training
Hi, I am new to the MLIP workflow and am experiencing some issues related to slow training. I am noticing that the training time for 1 cpu vs 16 cpus is roughly the same. In my job output file it states "MTPR parallel training started," hence I think I installed the parallel version. Nonetheless, the training times do not reflect that. I would appreciate any and all help. Thanks.
My job submission script looks like this:
#!/bin/bash
#SBATCH --job-name=mtp_time_comp_8
#SBATCH --mail-type=ALL #SBATCH --mail-user=
#SBATCH --output train.out #SBATCH --error train.err
#SBATCH --account=
#SBATCH --qos=
#SBATCH --partition=hpg-default
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=8gb
#SBATCH --time=4-0:00:00
cd $SLURM_SUBMIT_DIR
pwd;
date;
ml intel/2020.0.166 openmpi/4.1.5
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
~/mlip-2/build/mlp train 16.mtp ./../../trainfile.cfg --trained-pot-name=pot.mtp --valid-cfgs=./../../testfile.cfg --energy-weight=1 --force-weight=0.001 --weighting=structures --max-iter=1000 --init-params=same --skip-preinit
echo 'Done.'
date;