Slow Training

Hi, I am new to the MLIP workflow and am experiencing some issues related to slow training. I am noticing that the training time for 1 cpu vs 16 cpus is roughly the same. In my job output file it states "MTPR parallel training started," hence I think I installed the parallel version. Nonetheless, the training times do not reflect that. I would appreciate any and all help. Thanks.

My job submission script looks like this:

#!/bin/bash

#SBATCH --job-name=mtp_time_comp_8

#SBATCH --mail-type=ALL #SBATCH --mail-user=

#SBATCH --output train.out #SBATCH --error train.err

#SBATCH --account=

#SBATCH --qos=

#SBATCH --partition=hpg-default

#SBATCH --nodes=1

#SBATCH --cpus-per-task=8

#SBATCH --ntasks=1

#SBATCH --ntasks-per-node=1

#SBATCH --mem=8gb

#SBATCH --time=4-0:00:00

cd $SLURM_SUBMIT_DIR

pwd;

date;

ml intel/2020.0.166 openmpi/4.1.5

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

~/mlip-2/build/mlp train 16.mtp ./../../trainfile.cfg --trained-pot-name=pot.mtp --valid-cfgs=./../../testfile.cfg --energy-weight=1 --force-weight=0.001 --weighting=structures --max-iter=1000 --init-params=same --skip-preinit

echo 'Done.'

date;

Edited Jan 10, 2024 by michael macisaac