WIP: GPU: Hybrid vector and matrix types via libaxb, replacing CUDA and ViennaCL in the long run (!2972) · Merge requests · PETSc / petsc

Karl Rupp requested to merge karlrupp/feature-libaxb-introduction-2 into master Jul 14, 2020

This is a rebase of MR !2073 (closed) onto the current master. For the sake of completeness, the description is replicated below and extended where appropriate:

The 'GPU wrapper' is designed to allow for the PETSc team's development effort to scale in the human dimension: Instead of mostly reinventing the wheel for each out of {CUDA, OpenCL, SyCL, HiP, etc.}, it wraps up the low-level details of the hybrid architecture in a separate library (I named it libaxb as an hommage to Ax=b). The style of libaxb is very similar to PETSc to allow for seamless integration.

libaxb doesn't deal with MPI, because the hybrid architectures are a feature of the node (note: things like CUDA-aware MPI can still be optimized within PETSc if needed, without sacrificing the abstraction). Thus, we deal with the "MPI-handling" once for libaxb, and then plug in any other node-level library as needed (contrast this with the current approach, where the CUDA backend would have to be replicated for any other GPU platform). As such, libaxb is (hopefully) of interest to a non-PETSc audience as well, allowing for contributions to libaxb (and thus PETSc) also from non-PETSc users.

Important Concept 1: Splitting "memory" and "operations"

libaxb assigns to each entity a memory backend and an operations backend. The currently supported "memory" backends are {host, CUDA, and OpenCL}, referencing the runtime that manages the memory buffer(s). The "operations" backends define the set of operations defined on data that resides in one or more "memory" backends.

Examples on combining memory backends with operations backends:

With memory backend "CUDA", one can select at run-time from different operations backends like 'CUDA' (native cuda kernels), 'CUBLAS', 'CUSPARSE', 'MAGMA', etc. This way one can mix-and-match, selecting the most suitable operations from each library.
With memory backend "OpenCL", one can use "OpenCL" (a set of native kernels from OpenCL) or "clSparse"/"rocSparse" (tuned for AMD GPUs), etc.
With memory backend "host" one can combine MKL or OpenMP-accelerated routines via operations backends.
Because these backends are runtime flags, a user can quickly switch between all these backends by just passing the appropriate runtime flags (see usage below). I also envision that libaxb can auto-tune itself: Given a memory backend, libaxb could automatically compare the different operation backends 'online' and keep the fastest configuration for subsequent runs. Use case: A user needs to run many multigrid solves. In one 'training' run libaxb runs each operation once for each operations backend available for the selected memory backend. All subsequent runs will then use e.g. the fastest SpMV available at each of the multigrid levels (e.g. cuSparse on the finest, libaxb-native on the second-finest, etc.)

Important concept 2: Data types are runtime parameters

With mixed precision being considered an interesting path to explore every now and then, libaxb doesn't hard-ware integer or floating data types. Thus, whenever data goes into or out of libaxb objects, the respective data type needs to be specified. This makes the API a bit more cumbersome to use, but it offers new mixed-precision options in PETSc.

Usage

Configure with --download-libaxb. If the CUDA backend in libaxb should be made available, configure PETSc by adding --with-cuda. Similarly for OpenCL.

Similar to existing GPU backends:

-vec_type axb to select a hybrid vector
-mat_type aijaxb to select a hybrid AIJ matrix Additionally:
-axb_view to show available and selected backends
-axb_memory to select the memory backend for hybrid types
-axb_ops to select the operations backend for hybrid types

Implications on CUDA and ViennaCL bindings

The new functionality in this PR should (better sooner than later) fully replace the CUDA and VIENNACL types. I'll carefully make sure that existing functionality is not lost.

Open TODOs

Follow-up commit introducing the matrix type AIJAXB. The first commit in this MR just carries over the older MR !2073 (closed).

Current Limitations/Shortcuts

The following temporary shortcuts have been taken to get this MR out of the door:

Currently there is only one operation backend per memory backend implemented in libaxb. Additional backends can be added easily (within a few days!) and will be provided over the next weeks (as time permits).
Support for 32bit-integers and double precision scalars only. This will be relaxed considerably over the next weeks and months and requires barely any modifications to the PETSc bindings. The API for switching integers and floating point precision is already available.
Support for streams (CUDA) and the equivalent in OpenCL and other runtimes isn't exposed in the libaxb API just yet and will be added later. This MR is already large, so making it even bigger will make it even harder to get in.

Broader Impact

One can specify the desired precision at run time. Some discussion on how to provide appropriate options and their implications is needed. For example, several places reuse the system matrix A within a preconditioner P. However, what if the preconditioner matrix should be of lower precision?
The ability to select from different compute kernels in libaxb requires additional runtime options in PETSc. For example, a user might want to use two different kernels for sparse matrix-vector products depending on the sparsity of the matrix. @rtmills suggested something like -mat_type aijaxb -axb_mat_ops matvec:gingko,default:cusparse to select an SpMV kernel from Ginkgo while all other kernels are from CUSPARSE.

Edited Jul 14, 2020 by Karl Rupp

WIP: GPU: Hybrid vector and matrix types via libaxb, replacing CUDA and ViennaCL in the long run