Skip to content

SELL-based SpMV

Hong Zhang requested to merge hongzh/sell-cuda into main

Testing results show that SELL overperforms CSR (CuSparse) for 83.5% (1385 of 1659) matrices from the SuiteSparse benchmark. The mean value of speedup is 2.17X (improved from 1.75X last year). Detailed features are summarized below:

  • Dynamic slice height that can be changed at runtime. The default is 16 for NVIDIA GPUs because it works for most matrices in practice and enables coalesced memory access naturally. For extremely irregular matrices, sometimes it is beneficial to use a smaller slice height (e.g. 8 or 4), which is also supported by the SpMV kernels developed. Column padding is automatically used to guarantee coalesced memory access.
  • A reduction strategy using warp-level primitives. Previous tree-based reduction goes through global memory. The new approach uses registers and shared memory directly.
  • CUDA event timer. This GPU timer allows us to time the kernels more accurately, especially for small cases. It also allows us to eliminate WaitForCUDA() in many places.
  • A simple heuristic derived from extensive experiments. We provide multiple kernels for users to fine-tune the performance. By default, PETSc picks a suitable kernel based on some basic information of the matrix, such as average nonzeros per row, maximum nonzeros across the rows.
Edited by Hong Zhang

Merge request reports