LU factorisation (!74) · Merge requests · bandicoot-lib / bandicoot-code

Ryan Curtin requested to merge rcurtin/bandicoot-code:lu into unstable May 17, 2023

This ports the MAGMA dgetrf_gpu() and sgetrf_gpu() functions to Bandicoot's OpenCL backend, and uses the cuSolver Xgetrf() functions to provide an LU factorisation, accessed via the lu() function.

The behavior is meant to match Armadillo closely: one version will provide the permutation matrix P also, and the other will return P.t() * L.

There was a long hold-up here, as I first ported the MAGMA version that does not use pivoting (dgetrf_nopiv_gpu()), before finding documentation in a paper about the LU decomposition that non-pivoting versions tend to be very unsafe. I wish I had found that paper first...

Speedups of the LU factorisation are less dramatic than, e.g., matrix multiplication, but there is still a speedup for sufficiently large matrices. In addition, the cuSolver implementation is faster (at least on my RTX2080ti) than OpenCL by a decent margin.

Here are benchmarks for 5 trials of the LU decomposition on a 5000x5000 square matrix (this returns P.t() * L):

lu, rtx2080ti, cpu, float, 5000, 5000, 0, 0.599317
lu, rtx2080ti, cpu, float, 5000, 5000, 1, 0.590032
lu, rtx2080ti, cpu, float, 5000, 5000, 2, 0.591821
lu, rtx2080ti, cpu, float, 5000, 5000, 3, 0.644595
lu, rtx2080ti, cpu, float, 5000, 5000, 4, 0.613173

lu, rtx2080ti, opencl, float, 5000, 5000, 0, 0.448067
lu, rtx2080ti, opencl, float, 5000, 5000, 1, 0.328708
lu, rtx2080ti, opencl, float, 5000, 5000, 2, 0.841338
lu, rtx2080ti, opencl, float, 5000, 5000, 3, 0.337469
lu, rtx2080ti, opencl, float, 5000, 5000, 4, 0.350125

lu, rtx2080ti, cuda, float, 5000, 5000, 0, 0.0549754
lu, rtx2080ti, cuda, float, 5000, 5000, 1, 0.0583679
lu, rtx2080ti, cuda, float, 5000, 5000, 2, 0.0562912
lu, rtx2080ti, cuda, float, 5000, 5000, 3, 0.0526977
lu, rtx2080ti, cuda, float, 5000, 5000, 4, 0.0536727

lu, rtx2080ti, cpu, double, 5000, 5000, 0, 0.812786
lu, rtx2080ti, cpu, double, 5000, 5000, 1, 0.851364
lu, rtx2080ti, cpu, double, 5000, 5000, 2, 0.849764
lu, rtx2080ti, cpu, double, 5000, 5000, 3, 0.825078
lu, rtx2080ti, cpu, double, 5000, 5000, 4, 0.897321

lu, rtx2080ti, opencl, double, 5000, 5000, 0, 0.479671
lu, rtx2080ti, opencl, double, 5000, 5000, 1, 0.391531
lu, rtx2080ti, opencl, double, 5000, 5000, 2, 0.345917
lu, rtx2080ti, opencl, double, 5000, 5000, 3, 0.347555
lu, rtx2080ti, opencl, double, 5000, 5000, 4, 0.322029

lu, rtx2080ti, cuda, double, 5000, 5000, 0, 0.26085
lu, rtx2080ti, cuda, double, 5000, 5000, 1, 0.26179
lu, rtx2080ti, cuda, double, 5000, 5000, 2, 0.266077
lu, rtx2080ti, cuda, double, 5000, 5000, 3, 0.280755
lu, rtx2080ti, cuda, double, 5000, 5000, 4, 0.265287

Note that the first OpenCL run may have a little bit of kernel compilation time from inside clBLAS.

For the LUP decomposition the results are basically the same:

lup, rtx2080ti, cpu, float, 5000, 5000, 0, 0.2877
lup, rtx2080ti, cpu, float, 5000, 5000, 1, 0.297605
lup, rtx2080ti, cpu, float, 5000, 5000, 2, 0.278946
lup, rtx2080ti, cpu, float, 5000, 5000, 3, 0.243984
lup, rtx2080ti, cpu, float, 5000, 5000, 4, 0.22889

lup, rtx2080ti, opencl, float, 5000, 5000, 0, 0.365978
lup, rtx2080ti, opencl, float, 5000, 5000, 1, 0.426145
lup, rtx2080ti, opencl, float, 5000, 5000, 2, 0.31194
lup, rtx2080ti, opencl, float, 5000, 5000, 3, 0.359475
lup, rtx2080ti, opencl, float, 5000, 5000, 4, 0.345331

lup, rtx2080ti, cuda, float, 5000, 5000, 0, 0.0503121
lup, rtx2080ti, cuda, float, 5000, 5000, 1, 0.0513349
lup, rtx2080ti, cuda, float, 5000, 5000, 2, 0.0517719
lup, rtx2080ti, cuda, float, 5000, 5000, 3, 0.0510887
lup, rtx2080ti, cuda, float, 5000, 5000, 4, 0.0555293

lup, rtx2080ti, cpu, double, 5000, 5000, 0, 0.5166
lup, rtx2080ti, cpu, double, 5000, 5000, 1, 0.515409
lup, rtx2080ti, cpu, double, 5000, 5000, 2, 0.469594
lup, rtx2080ti, cpu, double, 5000, 5000, 3, 0.543063
lup, rtx2080ti, cpu, double, 5000, 5000, 4, 0.536805

lup, rtx2080ti, opencl, double, 5000, 5000, 0, 0.416509
lup, rtx2080ti, opencl, double, 5000, 5000, 1, 0.397334
lup, rtx2080ti, opencl, double, 5000, 5000, 2, 0.376364
lup, rtx2080ti, opencl, double, 5000, 5000, 3, 0.38218
lup, rtx2080ti, opencl, double, 5000, 5000, 4, 0.32946

lup, rtx2080ti, cuda, double, 5000, 5000, 0, 0.261128
lup, rtx2080ti, cuda, double, 5000, 5000, 1, 0.259326
lup, rtx2080ti, cuda, double, 5000, 5000, 2, 0.26427
lup, rtx2080ti, cuda, double, 5000, 5000, 3, 0.267787
lup, rtx2080ti, cuda, double, 5000, 5000, 4, 0.261496

Note that the CPU is slower for the LU version by a decent margin---this is probably just the cost of computing P.t() * L (which for the GPU, due to how I wrote the kernels to extract L and U from the results of getrf(), doesn't really cost much extra at all).

Anyway, the speeds pretty much equalize as the size of the matrix gets down to 500x500, and by 50x50, the CPU is an order of magnitude faster. (That's not particularly surprising.)

If you want to run the benchmarks yourself, see benchmarks/lu.cpp.

LU factorisation

Merge request reports