Optimize linear_solver_batch for GPU (move pack/unpack and rewrite mesh_batch_nrm2)
Move the pack/unpack operations up to
linear_solver_solve_HXeY_batch. This should speed up calculations by removing unnecessary pack/unpack operations. Fruthermore, the mesh_batch_nrm2 routine has been optimized, replacing the state-by-state calls to the cublas_nrm2 calls by one kernel to perform the modulus-square on the whole batch, followed by a zgemv to perform the sumation over grid points.
Speed up the linear_solver by reducing the number of pack/unpack operations and optimizing the calculation of the norm.
- I have checked that my code follows the Octopus coding standards
- I have added tests for all the new features added in this request.
Closes #229 (closed)