WIP: Use batched cublas gemm calls for DOTPV_BATCH
Use gemm_strided_batched to reduce the number of kernel launches to just one per batch instead of one per state. For the OpenCL version, keep the loop over gemm calls with offsets because there is no batched gemm call for OpenCL.
Use batched cublas gemm calls for DOTPV_BATCH
- I have checked that my code follows the Octopus coding standards