Use several streams for DOTPV_BATCH in the CUDA Version. With this approach, the dot products with offsets can be effectively overlapped. This is also implemented for mesh_batch_nrm2.
Depends on !675 (merged).
Use streams for DOTPV_BATCH
- I have checked that my code follows the Octopus coding standards
- I have added tests for all the new features added in this request.