Slow training

Hi, I have recently started working with the mlip package. I am currently experiencing slower training times than I believe what is expected and any guidance would be very much appreciated. I have developed a number of different mtp_level potentials and their their training times can be found later in this post. The data set consists of 4000 training structures and 1000 validation structures. The structures have an average atom count between 8 and 9. The convergence tolerance is 0.001 and max_iter is 1000. For installation details I followed the manual and the following steps:

git clone https://gitlab.com/ashapeev/mlip-2.git ./configure make mlp

All trainings were completed using 32 cores cores, with all cores on the same node.

mtp level 10: 5.5 hours

mtp level 12: 7.5 hours

mtp level 14: 16 hours

mtp level 16: 24 hours

mtp level 18: 36 hours

mtp level 20: currently training at 63 hours, predicted > 72 hours

It is my understanding that the parallel version of mlp is installed, unless I am mistaken. I am wondering what adjustments I may need to make to my installation to achieve more efficient training. If I am using the wrong installation instructions I apologize and hope to be informed of the best install protocols.

Additionally, for the trainings that have completed, the f-value is only reaching around 35. Is this most likely due to a lack of training structures? The f-value even for the most complex mtp I have trained thus far, level 18, has a f-value of 28.

Apologies for the long post, I look forward to any and all suggestions!

Michael

Edited Sep 19, 2023 by michael macisaac