Slow training
Hi, I have recently started working with the mlip package. I am currently experiencing slower training times than I believe what is expected and any guidance would be very much appreciated. I have developed a number of different mtp_level potentials and their their training times can be found later in this post. The data set consists of 4000 training structures and 1000 validation structures. The structures have an average atom count between 8 and 9. The convergence tolerance is 0.001 and max_iter is 1000. For installation details I followed the manual and the following steps:
git clone https://gitlab.com/ashapeev/mlip-2.git ./configure make mlp
All trainings were completed using 32 cores cores, with all cores on the same node.
mtp level 10: 5.5 hours
mtp level 12: 7.5 hours
mtp level 14: 16 hours
mtp level 16: 24 hours
mtp level 18: 36 hours
mtp level 20: currently training at 63 hours, predicted > 72 hours
It is my understanding that the parallel version of mlp is installed, unless I am mistaken. I am wondering what adjustments I may need to make to my installation to achieve more efficient training. If I am using the wrong installation instructions I apologize and hope to be informed of the best install protocols.
Additionally, for the trainings that have completed, the f-value is only reaching around 35. Is this most likely due to a lack of training structures? The f-value even for the most complex mtp I have trained thus far, level 18, has a f-value of 28.
Apologies for the long post, I look forward to any and all suggestions!
Michael