Overlap GPU communications and computation
When using StatePack = no
each batch is copied to and from the GPU when batches are swapped.
These communications can, in principle, be overlapped, and also overlap with the computations.
This is now implemented in !771 (merged). After the reorganization of the batches, the data can be copied from X(ff_pack) directly to the GPU without the need to transpose it. Thus, the copies can now be made asynchronous and overlapped with the kernels. This is done in the propagator, where the next batch is always prefetched (i.e. copied asynchronously) while the current batch is being processed. With this, overlapping of copies to and from the GPU as well as kernel executions are possible.
Old description:
In order to avoid race conditions on the GPU memory, operations need to be combined into streams. Then operations could, in principle, be launched asynchronously. In practice, this is not yet possible, as some variables are only in temporary arrays at the time of lauching the memcopy or the kernel.
This could be solved by having several threads on the CPU, each of which owns one stream on the GPU. Then, all operations within a thread could be synchronous with the GPU. It requires, however, that all routines are threadsafe, which currently they are not.
Alternatively, the streams could be handled by the same thread, but then code needs to be changed to ensure that transfers to the GPUs are not performed from temporary variables.
What would be the best way?
Related to #164 (closed).