add CUDA backend (!2) · Merge requests · bandicoot-lib / bandicoot-code

Ryan Curtin requested to merge rcurtin/bandicoot-code:cuda-backend into unstable Sep 23, 2019

I wanted to breathe some life back into this effort because I think it's really quite important and has a lot of potential, and I only wish that I had had more time for this before now. As far as I can tell, there is still no solution for GPU linear algebra programming in C++ that I find robust or easy to use at all.

In any case, one of the conclusions we came to in the past was that OpenCL performance on double-precision floating point was really not that great on the devices that we tested. But, GPU double-precision floating point performance is clearly good enough for the machine learning community as they use it extensively via TensorFlow and other machine learning frameworks. In that community, nVidia is virtually completely dominant, in part thanks to their efforts with projects like cuBLAS and cuDNN and all of the other related CUDA libraries.

After some basic experimentation a few months ago, I found that clMAGMA (and other OpenCL implementations) significantly underperform equivalent CUDA implementations. At that point, I started refactoring bandicoot so that it could support either a CUDA or OpenCL backend, depending on the user's preference. I hadn't said anything about it until now because I wanted to be sure that the benchmarking results justified the strategy. I'll discuss that more below but I hope you will agree with me that it very much does.

Now with all that introduced, this (quite large) MR makes the following general changes:

coot_rt_t now is just a wrapper for two separate runtimes: coot::opencl::runtime_t and coot::cuda::runtime_t. A coot_rt_t is available as a singleton, accessible via coot::get_rt().
Whether CUDA (or OpenCL) support is available depends on the status of COOT_USE_OPENCL and COOT_USE_CUDA in config.hpp.
The backend can be set via coot_rt().backend (i.e. coot_rt().backend = CL_BACKEND or coot_rt().backend = CUDA_BACKEND. For now this needs to be done with care...
The device memory for a coot::Mat is held as a union of the CUDA pointer (eT*) and the OpenCL pointer (void*).
All GPU-specific code is done by calling an interface provided by coot_rt_t, which depending on the value of get_rt().backend will in turn call out to functionality in opencl/ or cuda/.
All operations for either backend are tested in tests/. The current operations that are known to work are: accu(), +/-/*/% on matrices, chol(), matrix multiplication (gemm), matrix-vector multiplication (gemv), sum(), trace(), and various elementwise operations that were already implemented for OpenCL.
Really, really awful RNG support was added for the tests.

In order to validate the approach, I added benchmarking scripts in benchmarks/ and tested several operations on some nVidia GPUs that I have lying around: an RTX2080ti (that I ostensibly bought for this effort but actually use to play games across 3 4k monitors when I have a chance :)), a GTX1080ti, and a GTX980. I ran benchmarks using three backends: the CPU (via Armadillo), OpenCL, and CUDA. I also ran with two element types (float and double).

Proceeding in alphabetical order... we'll start with accu().

The kernel used for CUDA is a direct transliteration of the one used for OpenCL. OpenCL runs (orange) are noticeably slower than CUDA (green), with the difference an order of magnitude or more at small sizes. At larger sizes, it's not an order of magnitude but that is still noticeable---for the RTX2080ti with 1B elements, specifically, OpenCL takes about 1.46 seconds and CUDA takes 0.78 seconds... that's still quite a large margin. I was surprised to see that.

Now, it's worth pointing out that the accu() kernels being used here aren't necessarily great and they could definitely be improved. In fact, Armadillo on each of the 3 systems I used was generally able to outperform either CUDA or OpenCL. :)

Next, chol():

The Cholesky decomposition is implemented via clMAGMA for OpenCL, and via cuSolverDn for CUDA. clMAGMA seems to perform well with small matrix sizes, but float performance for large matrices is heavily in cuSolverDn's favor.

Next, a full matrix copy:

At a low level, this is using cudaMemcpy() for CUDA and clEnqueueWriteBuffer() for OpenCL. It is somewhat surprising to me again that the CUDA implementation is roughly 25% faster.

Next, filling a matrix (that previously contained random values) with zeros:

Performance is basically equivalent-ish there for large matrix sizes, but some strange artifacts at medium sizes.

Next, matrix multiplication for square matrices.

OpenCL uses clBLAS while CUDA uses cuBLAS. cuBLAS has a pretty big margin for float-valued matrices, but not for doubles (only a slight edge there). Benchmarks for when the first or second operand is transposed look virtually identical.

Next, matrix-vector multiplication.

It was surprising to see OpenCL perform well here, and I think this is the only case where I saw that somewhat resoundingly. I am wondering if I am using cublasSgemv and cublasDgemv wrong here, but I have not investigated that quite yet.

The last task I benchmarked was submatrix extraction.

Performance is roughly equivalent, but skewed slightly in CUDA's favor.

Overall, after that benchmarking, my conclusion is this: CUDA provides a significant enough speed gain for nVidia devices that it (and its ecosystem) can't be ignored for this project, despite the ugliness of much of CUDA code and the general unfriendliness of nVidia towards open-source efforts. Even for equivalent kernels CUDA provides significant and noticeable speedup (and I see this behavior informally with the other cases where I have transliterated the OpenCL kernels too).

Some of the really nice things about supporting multiple backends are that

(a) it would not be too hard to add other backends if they eventually are needed (perhaps some of the SyCL efforts or the weirder stuff like Xilinx's triSYCL, depending on how all that plays out)

(b) there's no need for us to claim that OpenCL is competitive with or faster than whatever toolkit... we just wrap whichever and the user chooses what to do

There are a specific number of questions and issues I intentionally did not address in this MR. I figured we should discuss and work out what the best strategy is.

The code currently allows switching backends willy-nilly. That's kind of a disaster waiting to happen. There are a number of options: try to support allowing each individual Mat to be associated with a particular backend; never allow changing backends; allow changing backends only when calling a special change_backend() function; lots of options...
A wrapper library may be useful due to the massive number of dependencies brought in by the CUDA ecosystem (see the Makefile in tests/...). This actually happens a little bit with OpenCL too. I think the same strategy as Armadillo could work here too.
Low precision support could be really useful (fp16/lower), as this has been found to be quite useful in deep learning applications---and also gives some additional speedup.
I haven't considered multi-device support. It might be interesting for another day.
I haven't optimized most of the kernels and implementations. I got them working well enough and left the OpenCL implementations as-is.
I haven't really considered or tested thread-safety for the CUDA backend (or the OpenCL backend). I think it works but I am not sure.

Anyway, I'm sure I've written more than enough words now about this MR (probably more words than there are code changes... not sure). It's been about a month of consistent effort, and I can see that the library is not too far from a plausible release: just need to wrap more functionality, provide some benchmarks and documentation, see if it works with ensmallen (nice advertisement for both), and work out a few more details. I often get asked, "when will Bandicoot be ready?" and given the results I've seen through this refactoring, I'm excited about the possibility of being able to say "now" sometime in the near future. :)

add CUDA backend

Merge request reports