Use batched cuBlas call in GPU version of DOTPV_BATCH

Right now, there are many small calls to X(dot) kernels. This could probably be merged into one large batched dgemm call (of 1xN matrices with Nx1 matrices) to reduce the overhead of launching the kernel.