zDOTPV_BATCH is inefficient for GPUs

We should rewrite this routine to reduce the cost. At the moment, the cost can be as high as 25% to the total compute time for time propagation using the Lanczos exponential method.