WIP: Use batched cublas gemm calls for DOTPV_BATCH
Description
Use gemm_strided_batched to reduce the number of kernel launches to just one per batch instead of one per state. For the OpenCL version, keep the loop over gemm calls with offsets because there is no batched gemm call for OpenCL.
News snippet
Use batched cublas gemm calls for DOTPV_BATCH
Checklist
-
I have checked that my code follows the Octopus coding standards
Edited by Sebastian Ohlmann