Batchify get_coefficient for DFT+U on CPU
Description
This merge request introduces the batchification of several parts of related to DFT+U. In particular, this batchifies:
- the calculation of the PDOS
- the calculation of the projected band structure
- the calculation of the projection on the atomic orbitals on CPU
Rewrite the CPU version of the DFT+U routine for getting coefficients, to remove the explicit call to batch_get_states. This should not be seen as a fully optimized version, as one can certainly do better, especially for the packed case.
This fixes the performance regression compared to Octopus 12, as reported in #1206 (closed) .
For the Ag13 cluster considered in the issue, performances are
| Version | Averaged time per iter [s] | ZORBSET_GET_COEFF [s] |
|---|---|---|
| 12.0 | 1.50 | 23 |
| main | 2.187 | 160 |
| This work | 1.276 | 15 |
The cluster of Ag13 therefore runs therefore 15% faster with PBE+U that the version 12.0.
Closes #1206 (closed)
News snippet
Fix a performance regression on CPU for DFT+U calculations and optimize further some parts.
Checklist
-
I have checked that my code follows the Octopus coding standards -
I have added tests for all the new features added in this request.