Batchify get_coefficient for DFT+U on CPU

Description

This merge request introduces the batchification of several parts of related to DFT+U. In particular, this batchifies:

  • the calculation of the PDOS
  • the calculation of the projected band structure
  • the calculation of the projection on the atomic orbitals on CPU

Rewrite the CPU version of the DFT+U routine for getting coefficients, to remove the explicit call to batch_get_states. This should not be seen as a fully optimized version, as one can certainly do better, especially for the packed case.

This fixes the performance regression compared to Octopus 12, as reported in #1206 (closed) .

For the Ag13 cluster considered in the issue, performances are

Version Averaged time per iter [s] ZORBSET_GET_COEFF [s]
12.0 1.50 23
main 2.187 160
This work 1.276 15

The cluster of Ag13 therefore runs therefore 15% faster with PBE+U that the version 12.0.

Closes #1206 (closed)

News snippet

Fix a performance regression on CPU for DFT+U calculations and optimize further some parts.

Checklist

  • I have checked that my code follows the Octopus coding standards
  • I have added tests for all the new features added in this request.

Merge request reports

Loading