WIP: Resolve "Improve GPU performance of DOTPV_BATCH"
The GPU version of DOTPV_BATCH has been rewritten in analogy to the MESH_BATCH_NRM2 routine. This avoids repeated calls to the accel_dot kernel, and performs all dot propdutcs within a batch simultaneously, using a cuBLAS GEMV call and a new kernel performing the pointwise multiplication of two batches.
Improvement of the GPU performance of DOTPV_BATCH, by rewriting the method in terms of less kernel calls.
- I have checked that my code follows the Octopus coding standards
- I have added tests for all the new features added in this request.