Skip to content

`eig_sym()`

Ryan Curtin requested to merge rcurtin/bandicoot-code:eig_sym into unstable

Surprisingly, this didn't take as long as I thought. I added eig_sym() by adapting MAGMA's dsyevd() and ssyevd(), and using cuSolverDn's Xsyevd().

It ends up being a lot of code, though:

  • Ported several MAGMA functions, including MAGMABLAS utilities.
  • Added tests for ported MAGMA functions (themselves also ports of MAGMA tests).
  • Added eig_sym() backend functions for both OpenCL and CUDA.
  • Handled an OpenCL bug on some nvidia systems where subgroup support is not available (strangely).
  • Added test suite and benchmarks for eigendecompositions.

Performance is okay. On my RTX2080ti, there's effectively no speedup over the CPU for small matrices. (eig_sym_1 does not compute eigenvectors; eig_sym_2 does.)

eig_sym_1, rtx2080ti, cpu, float, 200, 200, 0, 0.00324448
eig_sym_1, rtx2080ti, cpu, float, 200, 200, 1, 0.0019189
eig_sym_1, rtx2080ti, cpu, float, 200, 200, 2, 0.00191259
eig_sym_1, rtx2080ti, cpu, float, 200, 200, 3, 0.00195748
eig_sym_1, rtx2080ti, cpu, float, 200, 200, 4, 0.00189927
eig_sym_1, rtx2080ti, opencl, float, 200, 200, 0, 0.18668
eig_sym_1, rtx2080ti, opencl, float, 200, 200, 1, 0.00221236
eig_sym_1, rtx2080ti, opencl, float, 200, 200, 2, 0.00211009
eig_sym_1, rtx2080ti, opencl, float, 200, 200, 3, 0.00211085
eig_sym_1, rtx2080ti, opencl, float, 200, 200, 4, 0.00213096
eig_sym_1, rtx2080ti, cuda, float, 200, 200, 0, 0.00229559
eig_sym_1, rtx2080ti, cuda, float, 200, 200, 1, 0.00206645
eig_sym_1, rtx2080ti, cuda, float, 200, 200, 2, 0.00426481
eig_sym_1, rtx2080ti, cuda, float, 200, 200, 3, 0.00205592
eig_sym_1, rtx2080ti, cuda, float, 200, 200, 4, 0.00204227
eig_sym_1, rtx2080ti, cpu, double, 200, 200, 0, 0.00274093
eig_sym_1, rtx2080ti, cpu, double, 200, 200, 1, 0.00256441
eig_sym_1, rtx2080ti, cpu, double, 200, 200, 2, 0.0025543
eig_sym_1, rtx2080ti, cpu, double, 200, 200, 3, 0.00259927
eig_sym_1, rtx2080ti, cpu, double, 200, 200, 4, 0.00258964
eig_sym_1, rtx2080ti, opencl, double, 200, 200, 0, 0.00287491
eig_sym_1, rtx2080ti, opencl, double, 200, 200, 1, 0.00345152
eig_sym_1, rtx2080ti, opencl, double, 200, 200, 2, 0.00388182
eig_sym_1, rtx2080ti, opencl, double, 200, 200, 3, 0.00346045
eig_sym_1, rtx2080ti, opencl, double, 200, 200, 4, 0.00282231
eig_sym_1, rtx2080ti, cuda, double, 200, 200, 0, 0.0110453
eig_sym_1, rtx2080ti, cuda, double, 200, 200, 1, 0.0084721
eig_sym_1, rtx2080ti, cuda, double, 200, 200, 2, 0.00931563
eig_sym_1, rtx2080ti, cuda, double, 200, 200, 3, 0.0102756
eig_sym_1, rtx2080ti, cuda, double, 200, 200, 4, 0.00969103
eig_sym_2, rtx2080ti, cpu, float, 200, 200, 0, 0.00694225
eig_sym_2, rtx2080ti, cpu, float, 200, 200, 1, 0.00397489
eig_sym_2, rtx2080ti, cpu, float, 200, 200, 2, 0.00567592
eig_sym_2, rtx2080ti, cpu, float, 200, 200, 3, 0.00415217
eig_sym_2, rtx2080ti, cpu, float, 200, 200, 4, 0.00410701
eig_sym_2, rtx2080ti, opencl, float, 200, 200, 0, 0.152741
eig_sym_2, rtx2080ti, opencl, float, 200, 200, 1, 0.00936481
eig_sym_2, rtx2080ti, opencl, float, 200, 200, 2, 0.00814352
eig_sym_2, rtx2080ti, opencl, float, 200, 200, 3, 0.00809262
eig_sym_2, rtx2080ti, opencl, float, 200, 200, 4, 0.00877366
eig_sym_2, rtx2080ti, cuda, float, 200, 200, 0, 0.00245916
eig_sym_2, rtx2080ti, cuda, float, 200, 200, 1, 0.00229966
eig_sym_2, rtx2080ti, cuda, float, 200, 200, 2, 0.00228561
eig_sym_2, rtx2080ti, cuda, float, 200, 200, 3, 0.00227887
eig_sym_2, rtx2080ti, cuda, float, 200, 200, 4, 0.00229172
eig_sym_2, rtx2080ti, cpu, double, 200, 200, 0, 0.00574728
eig_sym_2, rtx2080ti, cpu, double, 200, 200, 1, 0.00564246
eig_sym_2, rtx2080ti, cpu, double, 200, 200, 2, 0.00614105
eig_sym_2, rtx2080ti, cpu, double, 200, 200, 3, 0.00564683
eig_sym_2, rtx2080ti, cpu, double, 200, 200, 4, 0.00570821
eig_sym_2, rtx2080ti, opencl, double, 200, 200, 0, 0.0296498
eig_sym_2, rtx2080ti, opencl, double, 200, 200, 1, 0.0087113
eig_sym_2, rtx2080ti, opencl, double, 200, 200, 2, 0.00931823
eig_sym_2, rtx2080ti, opencl, double, 200, 200, 3, 0.00834185
eig_sym_2, rtx2080ti, opencl, double, 200, 200, 4, 0.00900794
eig_sym_2, rtx2080ti, cuda, double, 200, 200, 0, 0.0101469
eig_sym_2, rtx2080ti, cuda, double, 200, 200, 1, 0.011495
eig_sym_2, rtx2080ti, cuda, double, 200, 200, 2, 0.00899019
eig_sym_2, rtx2080ti, cuda, double, 200, 200, 3, 0.0124729
eig_sym_2, rtx2080ti, cuda, double, 200, 200, 4, 0.0101832

For larger matrices (here 4k x 4k), results are a bit better, with CUDA generally outperforming the CPU, and OpenCL outperforming the CPU... but only for double precision.

eig_sym_1, rtx2080ti, cpu, float, 4000, 4000, 0, 0.732205
eig_sym_1, rtx2080ti, cpu, float, 4000, 4000, 1, 0.628252
eig_sym_1, rtx2080ti, cpu, float, 4000, 4000, 2, 0.697783
eig_sym_1, rtx2080ti, cpu, float, 4000, 4000, 3, 1.43967
eig_sym_1, rtx2080ti, cpu, float, 4000, 4000, 4, 0.813445
eig_sym_1, rtx2080ti, opencl, float, 4000, 4000, 0, 1.85979
eig_sym_1, rtx2080ti, opencl, float, 4000, 4000, 1, 1.18699
eig_sym_1, rtx2080ti, opencl, float, 4000, 4000, 2, 1.0774
eig_sym_1, rtx2080ti, opencl, float, 4000, 4000, 3, 1.30749
eig_sym_1, rtx2080ti, opencl, float, 4000, 4000, 4, 1.05786
eig_sym_1, rtx2080ti, cuda, float, 4000, 4000, 0, 0.258409
eig_sym_1, rtx2080ti, cuda, float, 4000, 4000, 1, 0.25926
eig_sym_1, rtx2080ti, cuda, float, 4000, 4000, 2, 0.258903
eig_sym_1, rtx2080ti, cuda, float, 4000, 4000, 3, 0.259689
eig_sym_1, rtx2080ti, cuda, float, 4000, 4000, 4, 0.260121
eig_sym_1, rtx2080ti, cpu, double, 4000, 4000, 0, 1.64451
eig_sym_1, rtx2080ti, cpu, double, 4000, 4000, 1, 1.69224
eig_sym_1, rtx2080ti, cpu, double, 4000, 4000, 2, 1.6123
eig_sym_1, rtx2080ti, cpu, double, 4000, 4000, 3, 1.77473
eig_sym_1, rtx2080ti, cpu, double, 4000, 4000, 4, 1.78902
eig_sym_1, rtx2080ti, opencl, double, 4000, 4000, 0, 1.9258
eig_sym_1, rtx2080ti, opencl, double, 4000, 4000, 1, 1.43119
eig_sym_1, rtx2080ti, opencl, double, 4000, 4000, 2, 1.46618
eig_sym_1, rtx2080ti, opencl, double, 4000, 4000, 3, 1.49345
eig_sym_1, rtx2080ti, opencl, double, 4000, 4000, 4, 1.59165
eig_sym_1, rtx2080ti, cuda, double, 4000, 4000, 0, 0.623701
eig_sym_1, rtx2080ti, cuda, double, 4000, 4000, 1, 0.623664
eig_sym_1, rtx2080ti, cuda, double, 4000, 4000, 2, 0.62343
eig_sym_1, rtx2080ti, cuda, double, 4000, 4000, 3, 0.62434
eig_sym_1, rtx2080ti, cuda, double, 4000, 4000, 4, 0.624847
eig_sym_2, rtx2080ti, cpu, float, 4000, 4000, 0, 1.23467
eig_sym_2, rtx2080ti, cpu, float, 4000, 4000, 1, 1.22311
eig_sym_2, rtx2080ti, cpu, float, 4000, 4000, 2, 1.36994
eig_sym_2, rtx2080ti, cpu, float, 4000, 4000, 3, 1.70646
eig_sym_2, rtx2080ti, cpu, float, 4000, 4000, 4, 1.54606
eig_sym_2, rtx2080ti, opencl, float, 4000, 4000, 0, 2.12245
eig_sym_2, rtx2080ti, opencl, float, 4000, 4000, 1, 1.46433
eig_sym_2, rtx2080ti, opencl, float, 4000, 4000, 2, 1.31715
eig_sym_2, rtx2080ti, opencl, float, 4000, 4000, 3, 1.46089
eig_sym_2, rtx2080ti, opencl, float, 4000, 4000, 4, 1.26818
eig_sym_2, rtx2080ti, cuda, float, 4000, 4000, 0, 0.297018
eig_sym_2, rtx2080ti, cuda, float, 4000, 4000, 1, 0.277646
eig_sym_2, rtx2080ti, cuda, float, 4000, 4000, 2, 0.279513
eig_sym_2, rtx2080ti, cuda, float, 4000, 4000, 3, 0.299881
eig_sym_2, rtx2080ti, cuda, float, 4000, 4000, 4, 0.281558
eig_sym_2, rtx2080ti, cpu, double, 4000, 4000, 0, 2.92829
eig_sym_2, rtx2080ti, cpu, double, 4000, 4000, 1, 2.77422
eig_sym_2, rtx2080ti, cpu, double, 4000, 4000, 2, 3.7488
eig_sym_2, rtx2080ti, cpu, double, 4000, 4000, 3, 3.15056
eig_sym_2, rtx2080ti, cpu, double, 4000, 4000, 4, 3.05637
eig_sym_2, rtx2080ti, opencl, double, 4000, 4000, 0, 2.52218
eig_sym_2, rtx2080ti, opencl, double, 4000, 4000, 1, 2.3988
eig_sym_2, rtx2080ti, opencl, double, 4000, 4000, 2, 2.55384
eig_sym_2, rtx2080ti, opencl, double, 4000, 4000, 3, 2.42044
eig_sym_2, rtx2080ti, opencl, double, 4000, 4000, 4, 2.45546
eig_sym_2, rtx2080ti, cuda, double, 4000, 4000, 0, 1.00889
eig_sym_2, rtx2080ti, cuda, double, 4000, 4000, 1, 1.04562
eig_sym_2, rtx2080ti, cuda, double, 4000, 4000, 2, 1.02579
eig_sym_2, rtx2080ti, cuda, double, 4000, 4000, 3, 1.02887
eig_sym_2, rtx2080ti, cuda, double, 4000, 4000, 4, 1.01021

I'm just going to go ahead and assume that on less powerful OpenCL devices (like my poor laptop) there won't be any speedup at all.

There is probably room for further optimization here, but, at least for now my priority is just getting it working at all.

Merge request reports