Optimize linear_solver_batch for GPU (move pack/unpack and rewrite mesh_batch_nrm2)
Description
Move the pack/unpack operations up to linear_solver_solve_HXeY_batch
. This should speed up calculations by removing unnecessary pack/unpack operations. Fruthermore, the mesh_batch_nrm2 routine has been optimized, replacing the state-by-state calls to the cublas_nrm2 calls by one kernel to perform the modulus-square on the whole batch, followed by a zgemv to perform the sumation over grid points.
News snippet
Speed up the linear_solver by reducing the number of pack/unpack operations and optimizing the calculation of the norm.
Checklist
-
I have checked that my code follows the Octopus coding standards -
I have added tests for all the new features added in this request.
Closes #229 (closed)
Edited by Martin Lueders