Improve GPU performance of DOTPV_BATCH
Currently, the dotpv_batch routine X(mesh_batch_dotp_vector)
calls a X(accel_dot) for each state, which is highly inefficient. An approach by using the batched cuBLAS routines did not result in any improvement. However, the situation is similar to the mesh_batch_nrm2 routine, which was improved by reformulating the problem in terms of a point-wise mutliplication of the whole batch of states, followed by a cublas_dgemv call to to the sumation over the mesh.
Here we will attempt the same for the dotpv_batch routine.