Add R-Bidiagonalization step to BDCSVD
What does this implement/fix?
This adds an R-Bidiagonalization step to BDCSVD. (mentioned in section 8.6.3 of the Golub and Van Loan "Matrix Computations" book.)
- When the input matrix
Ais sufficiently tall (or wide), this first computes the QR decomposition ofA(orA*). It then just runs the normal bidiagonalization/svd routines on the top rows of R. Finally, it has to apply Q on the left ofU(orV) to get the actual SVD ofA = Q(R/0) = Q(USV^T/0). - Optimization for tall matrices since less work is done in bidiagonalization.
As far as I can tell, LAPACK always does this in dgesdd when there's sufficient workspace. I'm not sure if adding additional workspace here is acceptable, but this mr shouldn't add much more. A lot of the new workspace needed for the QR decomposition should be cancelled out because UpperBidiagonalization uses less.
Looked like an easy win, so I thought I'd try it out. (There was also some recent discussion on Discord where people were using BDCSVD on super tall matrices, so I think this meets a real need.)
Additional information
I think it's working pretty well. Passes the test suite and I'm pleased with the speedup I'm seeing.
These are from some informal benchmarks:
- Run on an AMD EPYC 7742, with gcc-10.2.0, -O3 -march=core-avx2
- show results for computing both Thin U and V and just computhing Thin V, for double and complex.
Small-ish
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
bdcsvd_double_computeThinUV/38/38 -0.0150 -0.0146 560730 552343 558549 550367
bdcsvd_double_computeThinUV/38/138 -0.0311 -0.0317 665697 644985 663199 642201
bdcsvd_double_computeThinUV/38/238 -0.0851 -0.0852 783044 716412 780009 713560
bdcsvd_double_computeThinUV/38/338 -0.0897 -0.0905 884546 805231 881206 801484
bdcsvd_double_computeThinUV/38/438 -0.0891 -0.0896 969086 882702 965227 878719
bdcsvd_double_computeThinUV/38/538 -0.1708 -0.1712 1118555 927540 1114049 923328
bdcsvd_double_computeThinUV/38/638 -0.1533 -0.1534 1213303 1027309 1208212 1022824
bdcsvd_double_computeThinUV/38/738 -0.2031 -0.2033 1334411 1063345 1329151 1058981
bdcsvd_double_computeThinUV/38/838 -0.1994 -0.1998 1430272 1145130 1424248 1139691
bdcsvd_double_computeThinV/38/38 +0.0027 +0.0025 495862 497220 494012 495256
bdcsvd_double_computeThinV/38/138 +0.0156 +0.0153 526925 535143 524829 532842
bdcsvd_double_computeThinV/38/238 -0.0104 -0.0108 564038 558148 562120 556069
bdcsvd_double_computeThinV/38/338 -0.0942 -0.0950 654858 593161 652641 590671
bdcsvd_double_computeThinV/38/438 -0.1722 -0.1726 746768 618174 743960 615546
bdcsvd_double_computeThinV/38/538 -0.1859 -0.1867 780695 635544 777907 632657
bdcsvd_double_computeThinV/38/638 -0.2150 -0.2149 842719 661564 839380 658989
bdcsvd_double_computeThinV/38/738 -0.2572 -0.2572 926010 687850 922064 684915
bdcsvd_double_computeThinV/38/838 -0.2878 -0.2876 996698 709868 992685 707202
Larger
Benchmark Time CPU Time Old Time New CPU Old CPU New
-------------------------------------------------------------------------------------------------------------------------------------------
bdcsvd_double_computeThinUV/500/500 -0.0223 -0.0218 120934799 118241344 120206450 117587615
bdcsvd_double_computeThinUV/500/1000 -0.0781 -0.0768 168039073 154912570 166927595 154101147
bdcsvd_double_computeThinUV/500/2000 -0.0841 -0.0839 252251583 231030995 250837282 229786319
bdcsvd_double_computeThinUV/500/3000 -0.2933 -0.2931 354488660 250504511 352466612 249157515
bdcsvd_double_computeThinUV/500/4000 -0.3849 -0.3845 498082335 306358967 495251025 304808607
bdcsvd_double_computeThinUV/500/5000 -0.4102 -0.4101 611187632 360497410 607805405 358562461
bdcsvd_double_computeThinUV/500/6000 -0.5858 -0.5855 993030518 411281976 986559546 408897937
bdcsvd_double_computeThinV/500/500 +0.0023 +0.0025 96144720 96369521 95591914 95829350
bdcsvd_double_computeThinV/500/1000 -0.0108 -0.0104 125544195 124185888 124790609 123487358
bdcsvd_double_computeThinV/500/2000 +0.0516 +0.0515 169231449 177963941 168368397 177041849
bdcsvd_double_computeThinV/500/3000 -0.3211 -0.3211 220459022 149679402 219407709 148955709
bdcsvd_double_computeThinV/500/4000 -0.3434 -0.3435 292084731 191771479 290552164 190751061
bdcsvd_double_computeThinV/500/5000 -0.4782 -0.4776 404396172 211012044 401878571 209961360
bdcsvd_double_computeThinV/500/6000 -0.6294 -0.6291 598237028 221678815 594283542 220415778
bdcsvd_complex_double_computeThinUV/500/500 +0.0916 +0.0897 255520963 278933399 254162600 276962257
bdcsvd_complex_double_computeThinUV/500/1000 -0.0046 -0.0063 470148880 467963802 467394496 464450050
bdcsvd_complex_double_computeThinUV/500/2000 +0.0331 +0.0314 917793280 948137549 912521721 941157706
bdcsvd_complex_double_computeThinUV/500/3000 -0.2198 -0.2211 1360012279 1061090288 1351584140 1052804356
bdcsvd_complex_double_computeThinUV/500/4000 -0.3050 -0.3053 1915540819 1331249650 1903755375 1322459094
bdcsvd_complex_double_computeThinUV/500/5000 -0.2804 -0.2804 2421978472 1742944820 2405996094 1731417102
bdcsvd_complex_double_computeThinUV/500/6000 -0.3471 -0.3468 2871235454 1874670684 2851029992 1862315016
bdcsvd_complex_double_computeThinV/500/500 -0.0028 -0.0027 197499887 196955268 196285795 195757475
bdcsvd_complex_double_computeThinV/500/1000 -0.0236 -0.0235 318554734 311042543 316618729 309192952
bdcsvd_complex_double_computeThinV/500/2000 -0.0431 -0.0423 567077441 542655686 563321471 539468600
bdcsvd_complex_double_computeThinV/500/3000 -0.4819 -0.4812 915616107 474385083 909423431 471792729
bdcsvd_complex_double_computeThinV/500/4000 -0.4868 -0.4862 1144582829 587365450 1137211810 584304876
bdcsvd_complex_double_computeThinV/500/5000 -0.5118 -0.5114 1436150418 701125911 1427595270 697484811
bdcsvd_complex_double_computeThinV/500/6000 -0.5330 -0.5328 1733322684 809429173 1722951672 805023436
OVERALL_GEOMEAN -0.2741 -0.2740 0 0 0 0
Very Tall
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
bdcsvd_double_computeThinUV/32/100000 -0.3170 -0.3175 226529586 154717160 225472766 153888702
bdcsvd_double_computeThinUV/264/100000 -0.6096 -0.6095 5149984761 2010317098 5119070612 1998970140
bdcsvd_double_computeThinUV/1024/100000 -0.6538 -0.6530 59780980101 20698847700 59313139187 20578710233
bdcsvd_double_computeThinV/32/100000 -0.2606 -0.2609 258735994 191316443 257246119 190123143
bdcsvd_double_computeThinV/264/100000 -0.6068 -0.6065 5527338202 2173494907 5493199942 2161352169
bdcsvd_double_computeThinV/1024/100000 -0.6419 -0.6411 60740516522 21750376091 60278909057 21633052616
bdcsvd_complex_double_computeThinUV/32/100000 -0.2993 -0.2989 636792072 446202903 633206937 443951859
bdcsvd_complex_double_computeThinUV/264/100000 -0.4889 -0.4878 18326362603 9366890170 18194148881 9318136331
bdcsvd_complex_double_computeThinUV/1024/100000 -0.4089 -0.4083 195544107745 115591737930 194190302999 114900596242
bdcsvd_complex_double_computeThinV/32/100000 -0.4411 -0.4417 368514577 205950727 366569465 204654500
bdcsvd_complex_double_computeThinV/264/100000 -0.6587 -0.6583 12379953027 4225803248 12292426264 4200267389
bdcsvd_complex_double_computeThinV/1024/100000 -0.6676 -0.6669 125078044965 41579651579 124086840510 41338883708
OVERALL_GEOMEAN -0.5259 -0.5255 7 3 7 3