Allocate packed host memory by default for wavefunctions and overlap computation/communication for TD
Batches that are initialized without external memory are now allocated directly in a packed state, but only for the state wavefunctions at the moment. The special flag is not passed when using copy_to to avoid allocating pinned memory for GPU runs which is very costly.
Also allow packing/unpacking to be asynchronous if it consists only of copying from packed memory on the host to the packed memory on the device and back. This is then used in the TD propagation to allow overlapping computation and data transfer by prefetching the next batch asynchronously while working on the current batch. For some tests I made, the overlapping for TD calculations improved the runtime by a factor of 1.8.
Closes #228 (closed).
Initialize batches in packed state for the wave functions. Introduce overlapping computation and data transfer for GPU runs.
- I have checked that my code follows the Octopus coding standards
- I have added tests for all the new features added in this request.