Performance Issues with DFT+U Calculation in Octopus-15.0

We have been facing some issues with the DFT+U calculation using Octopus-15.0. Before shifting to the latest version, we were using Octopus-12.0, where we never had any major problems.

We found that time-evolution PBE+U calculation on Octopus-15.0 is slower than when using Octopus-12.0. The time taken per time-evolution step becomes significantly large, up to an order of magnitude for bigger clusters (like in the case of Ag309).

For our tests, we compiled both versions of Octopus (12.0 and 15.0) on Jean Zay supercomputer (http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html) using the same script (same configuration) named "octopus-12.0-zj_MC_intel_oneapi-all_flags_-O3.sh" and "octopus-15.0-zj_MC_intel_oneapi-all_flags_-O3.sh" and can be found here: https://amubox.univ-amu.fr/s/DRRNtBr3BE6yANe

Below is the summary of our tests.

Test 1: PBE calculation on Ag55 cluster, on Octopus-12.0 and 15.0 (200 cores).

Ground-state convergence time with Octopus-12.0: 09m 54.709s
Ground-state convergence time with Octopus-15.0: 10m 44.73s
Time-evolution time taken per step with Octopus-12.0: approximately 1.7s
Time-evolution time taken per step with Octopus-15.0: approximately 1.7s
Conclusion: We didn't see any noticeable time difference between the performance of Octopus-12.0 and Octopus-15.0 for the PBE calculation.

Test 2: PBE+U calculation on Ag13 cluster, on Octopus-12.0 and 15.0 (40 cores)

Ground-state convergence time with Octopus-12.0: 9m 51.329s

Ground-state convergence time with Octopus-15.0: 9m 41.34s

Time-evolution time taken per step with Octopus-12.0: approximately 1.37s
Time-evolution time taken per step with Octopus-15.0: approximately 1.95s
Conclusion: For the ground state, there was no major difference in the time for the calculation to converge. However, the time evolution calculation is 1.4 times slower on Octopus-15.0

Test 3: PBE+U calculation on Ag55 cluster, on Octopus-12.0 and 15.0 (200 cores).

Ground-state convergence time with Octopus-12.0: 55m 58.935s
Ground-state convergence time with Octopus-15.0: 01h 01m 50.20s
Time-evolution time taken per step with Octopus-12.0: approximately 2.8s
Time-evolution time taken per step with Octopus-15.0: approximately 7.2s
Conclusion: For the ground state, there was no major difference in the time for the calculation to converge. However, the time evolution calculation is 2.6 times slower on Octopus-15.0.

Test 4: PBE+U calculation on Ag309 cluster, on Octopus-12.0 and 15.0 (400 cores)

Ground-state convergence time with Octopus-12.0: 07h 43m 35.607s

Ground-state convergence time with Octopus-15.0: 06h 33m 55.686s

Time-evolution time taken per step with Octopus-12.0: approximately 21.78s
Time-evolution time taken per step with Octopus-15.0: approximately 216.10s
Conclusion: For the ground state, Octopus 15.0 calculation converged faster by 1h 10m. However, the time evolution calculation is about an order of magnitude slower on Octopus-15.0.

All the tests can be found here: https://amubox.univ-amu.fr/s/DRRNtBr3BE6yANe

At this stage, we are not sure what is causing these issues, whether it's related to the code or the way we have compiled the code on the machine. We found equivalent slow-down on our local cluster between Octopus-9.1 and 15.0, but there we could not compile the codes identically.

Any kind of help would be greatly appreciated. I will be more than happy to provide any further information if needed.