Initial CUDA+OpenCL SVD implementation + MAGMA cleanup (!41) · Merge requests · bandicoot-lib / bandicoot-code

Ryan Curtin requested to merge rcurtin/bandicoot-code:svd into unstable Feb 16, 2023

One of the matrix norms needs the singular values of a matrix, so off down a rabbit hole I went... two weeks later, here's some code. I'm really happy I got anything at all working here; the decompositions have seemed intimidating for years.

Based on the existing MAGMA code that Conrad ported for chol(), I looked into the MAGMA sources (now version 2.7) and realized that I'd have to port over a whole pile of stuff for svd(). Fortunately that should mostly be the same stuff that's needed for eig_sym(), lu() and qr() and related decompositions. So the changeset is huge but the rough summary points are:

All MAGMA ports are now in bandicoot_bits/opencl/magma/. I added a README that discusses how I did the ports too, in case someone else wants to do more later. (They're not fun.)
All ported MAGMA functions also have ported tests in tests/magma/. Compiling them requires a working FORTRAN compiler, and that directory has its own Makefile. Those tests aren't built by default.
svd()'s OpenCL backend uses magma_dgesvd(), which is a hybrid CPU/GPU implementation. Unfortunately the first step is to copy the matrix back from GPU to CPU, and then call magma_dgesvd()... it's possible that this could be improved, but no need right now.
svd()'s CUDA backend uses cuSolverDn. (That one was a lot easier...)
I have noticed some flakiness with the OpenCL implementation. I am not yet sure if it is the nvidia OpenCL driver, or some other bug in the code, or what, but once I figure out how to reproduce it I'll be able to narrow it down.
Unused MAGMA functionality (from clMAGMA 1.3) has been removed---that functionality was just provided through newer versions of MAGMA that were ported too.
Early attempts at handwriting LAPACK functions with custom OpenCL kernels have been thrown away. That was a bad idea.

nvidia OpenCL driver issues keep me from running this successfully on the system with an RTX2080ti, but here are the benchmarks for 2500x2500 matrices on an old GTX 980:

Computing only singular values, no U or V^T

svd_no_uv, gtx980, cpu, float, 2500, 2500, 0, 10.9703
svd_no_uv, gtx980, cpu, float, 2500, 2500, 1, 10.9641
svd_no_uv, gtx980, cpu, float, 2500, 2500, 2, 10.97
svd_no_uv, gtx980, cpu, float, 2500, 2500, 3, 10.9762
svd_no_uv, gtx980, cpu, float, 2500, 2500, 4, 10.9581
svd_no_uv, gtx980, opencl, float, 2500, 2500, 0, 1.18189
svd_no_uv, gtx980, opencl, float, 2500, 2500, 1, 1.12752
svd_no_uv, gtx980, opencl, float, 2500, 2500, 2, 1.1258
svd_no_uv, gtx980, opencl, float, 2500, 2500, 3, 1.13067
svd_no_uv, gtx980, opencl, float, 2500, 2500, 4, 1.1264
svd_no_uv, gtx980, cuda, float, 2500, 2500, 0, 0.558766
svd_no_uv, gtx980, cuda, float, 2500, 2500, 1, 0.517043
svd_no_uv, gtx980, cuda, float, 2500, 2500, 2, 0.517197
svd_no_uv, gtx980, cuda, float, 2500, 2500, 3, 0.517078
svd_no_uv, gtx980, cuda, float, 2500, 2500, 4, 0.516812
svd_no_uv, gtx980, cpu, double, 2500, 2500, 0, 13.1234
svd_no_uv, gtx980, cpu, double, 2500, 2500, 1, 13.1125
svd_no_uv, gtx980, cpu, double, 2500, 2500, 2, 13.1135
svd_no_uv, gtx980, cpu, double, 2500, 2500, 3, 13.1256
svd_no_uv, gtx980, cpu, double, 2500, 2500, 4, 13.1167
svd_no_uv, gtx980, opencl, double, 2500, 2500, 0, 1.56305
svd_no_uv, gtx980, opencl, double, 2500, 2500, 1, 1.51827
svd_no_uv, gtx980, opencl, double, 2500, 2500, 2, 1.52242
svd_no_uv, gtx980, opencl, double, 2500, 2500, 3, 1.52261
svd_no_uv, gtx980, opencl, double, 2500, 2500, 4, 1.51982
svd_no_uv, gtx980, cuda, double, 2500, 2500, 0, 0.946693
svd_no_uv, gtx980, cuda, double, 2500, 2500, 1, 0.947286
svd_no_uv, gtx980, cuda, double, 2500, 2500, 2, 0.946691
svd_no_uv, gtx980, cuda, double, 2500, 2500, 3, 0.946494
svd_no_uv, gtx980, cuda, double, 2500, 2500, 4, 0.947212

Not too far off CUDA's performance. I'm reasonably happy with that.

Computing full SVD

svd_full, gtx980, cpu, float, 2500, 2500, 0, 32.9066
svd_full, gtx980, cpu, float, 2500, 2500, 1, 32.9683
svd_full, gtx980, cpu, float, 2500, 2500, 2, 32.9469
svd_full, gtx980, cpu, float, 2500, 2500, 3, 32.9515
svd_full, gtx980, cpu, float, 2500, 2500, 4, 32.9101
svd_full, gtx980, opencl, float, 2500, 2500, 0, 31.8677
svd_full, gtx980, opencl, float, 2500, 2500, 1, 31.6918
svd_full, gtx980, opencl, float, 2500, 2500, 2, 31.404
svd_full, gtx980, opencl, float, 2500, 2500, 3, 31.6763
svd_full, gtx980, opencl, float, 2500, 2500, 4, 31.5731
svd_full, gtx980, cuda, float, 2500, 2500, 0, 11.634
svd_full, gtx980, cuda, float, 2500, 2500, 1, 11.6638
svd_full, gtx980, cuda, float, 2500, 2500, 2, 11.6971
svd_full, gtx980, cuda, float, 2500, 2500, 3, 11.636
svd_full, gtx980, cuda, float, 2500, 2500, 4, 11.6462
svd_full, gtx980, cpu, double, 2500, 2500, 0, 40.2789
svd_full, gtx980, cpu, double, 2500, 2500, 1, 40.2107
svd_full, gtx980, cpu, double, 2500, 2500, 2, 40.2284
svd_full, gtx980, cpu, double, 2500, 2500, 3, 40.1234
svd_full, gtx980, cpu, double, 2500, 2500, 4, 40.178
svd_full, gtx980, opencl, double, 2500, 2500, 0, 45.4243
svd_full, gtx980, opencl, double, 2500, 2500, 1, 45.1055
svd_full, gtx980, opencl, double, 2500, 2500, 2, 45.0162
svd_full, gtx980, opencl, double, 2500, 2500, 3, 45.2775
svd_full, gtx980, opencl, double, 2500, 2500, 4, 45.0502
svd_full, gtx980, cuda, double, 2500, 2500, 0, 10.7714
svd_full, gtx980, cuda, double, 2500, 2500, 1, 10.7895
svd_full, gtx980, cuda, double, 2500, 2500, 2, 10.8657
svd_full, gtx980, cuda, double, 2500, 2500, 3, 10.8852
svd_full, gtx980, cuda, double, 2500, 2500, 4, 10.8549

Much less great here, but I wonder if it would be better with a more powerful GPU. I'll find out someday, but at least to me, the more important thing for now is that we have SVD support at all.

If anyone else wants to run the benchmarks, just make svd in the benchmarks/ directory and then run... e.g., ./svd gtx980 5 2500 2500 svd-benchmarks.csv is the command I ran.

I'll leave this for a few days and merge after that, in case anyone has any comments. Next steps are finishing norm() using this support, then qr(), lu(), and eig_sym().

Initial CUDA+OpenCL SVD implementation + MAGMA cleanup

Merge request reports