Skip to content

Implement transposes

Ryan Curtin requested to merge rcurtin/bandicoot-code:trans into unstable

This MR provides a simple non-inplace implementation of transposes. cuBLAS supplies some primitives for transposing float/double matrices, so I used those; in other places I used trivial kernels. Performance seems to be about the same between cuBLAS, CUDA kernels, and the OpenCL kernels I wrote.

I did implement op_strans and op_htrans, even though we do not currently support any complex elements, so the operations end up being identical. When the time comes to add complex support, it will be a little easier with the infrastructure I set up here.

Here are some simple benchmark results on transposing matrices of size 5k x 5k with different element types:

trans, rtx2080ti, cpu, float, 5000, 5000, 0, 0.0410803
trans, rtx2080ti, cpu, float, 5000, 5000, 1, 0.0407775
trans, rtx2080ti, cpu, float, 5000, 5000, 2, 0.0411488
trans, rtx2080ti, cpu, float, 5000, 5000, 3, 0.041334
trans, rtx2080ti, cpu, float, 5000, 5000, 4, 0.0409807
trans, rtx2080ti, opencl, float, 5000, 5000, 0, 0.001354
trans, rtx2080ti, opencl, float, 5000, 5000, 1, 0.00134989
trans, rtx2080ti, opencl, float, 5000, 5000, 2, 0.0013483
trans, rtx2080ti, opencl, float, 5000, 5000, 3, 0.00137172
trans, rtx2080ti, opencl, float, 5000, 5000, 4, 0.00135854
trans, rtx2080ti, cuda, float, 5000, 5000, 0, 0.000564771
trans, rtx2080ti, cuda, float, 5000, 5000, 1, 0.000527786
trans, rtx2080ti, cuda, float, 5000, 5000, 2, 0.000554309
trans, rtx2080ti, cuda, float, 5000, 5000, 3, 0.00154241
trans, rtx2080ti, cuda, float, 5000, 5000, 4, 0.00176127
trans, rtx2080ti, cpu, double, 5000, 5000, 0, 0.0672843
trans, rtx2080ti, cpu, double, 5000, 5000, 1, 0.0674348
trans, rtx2080ti, cpu, double, 5000, 5000, 2, 0.0674344
trans, rtx2080ti, cpu, double, 5000, 5000, 3, 0.0674752
trans, rtx2080ti, cpu, double, 5000, 5000, 4, 0.0674165
trans, rtx2080ti, opencl, double, 5000, 5000, 0, 0.0013842
trans, rtx2080ti, opencl, double, 5000, 5000, 1, 0.00138828
trans, rtx2080ti, opencl, double, 5000, 5000, 2, 0.001442
trans, rtx2080ti, opencl, double, 5000, 5000, 3, 0.00245004
trans, rtx2080ti, opencl, double, 5000, 5000, 4, 0.00139177
trans, rtx2080ti, cuda, double, 5000, 5000, 0, 0.0010774
trans, rtx2080ti, cuda, double, 5000, 5000, 1, 0.00227856
trans, rtx2080ti, cuda, double, 5000, 5000, 2, 0.0010726
trans, rtx2080ti, cuda, double, 5000, 5000, 3, 0.00107899
trans, rtx2080ti, cuda, double, 5000, 5000, 4, 0.00107525
trans, rtx2080ti, cpu, u32, 5000, 5000, 0, 0.0432899
trans, rtx2080ti, cpu, u32, 5000, 5000, 1, 0.0431485
trans, rtx2080ti, cpu, u32, 5000, 5000, 2, 0.0429224
trans, rtx2080ti, cpu, u32, 5000, 5000, 3, 0.0429015
trans, rtx2080ti, cpu, u32, 5000, 5000, 4, 0.0435896
trans, rtx2080ti, opencl, u32, 5000, 5000, 0, 0.00100086
trans, rtx2080ti, opencl, u32, 5000, 5000, 1, 0.000993269
trans, rtx2080ti, opencl, u32, 5000, 5000, 2, 0.000992364
trans, rtx2080ti, opencl, u32, 5000, 5000, 3, 0.00101455
trans, rtx2080ti, opencl, u32, 5000, 5000, 4, 0.00207038
trans, rtx2080ti, cuda, u32, 5000, 5000, 0, 0.00092557
trans, rtx2080ti, cuda, u32, 5000, 5000, 1, 0.000931778
trans, rtx2080ti, cuda, u32, 5000, 5000, 2, 0.000928857
trans, rtx2080ti, cuda, u32, 5000, 5000, 3, 0.000929476
trans, rtx2080ti, cuda, u32, 5000, 5000, 4, 0.000930546
trans, rtx2080ti, cpu, u64, 5000, 5000, 0, 0.0678621
trans, rtx2080ti, cpu, u64, 5000, 5000, 1, 0.0677879
trans, rtx2080ti, cpu, u64, 5000, 5000, 2, 0.0678003
trans, rtx2080ti, cpu, u64, 5000, 5000, 3, 0.0677589
trans, rtx2080ti, cpu, u64, 5000, 5000, 4, 0.0677823
trans, rtx2080ti, opencl, u64, 5000, 5000, 0, 0.00138334
trans, rtx2080ti, opencl, u64, 5000, 5000, 1, 0.0013905
trans, rtx2080ti, opencl, u64, 5000, 5000, 2, 0.00144542
trans, rtx2080ti, opencl, u64, 5000, 5000, 3, 0.00244962
trans, rtx2080ti, opencl, u64, 5000, 5000, 4, 0.00139335
trans, rtx2080ti, cuda, u64, 5000, 5000, 0, 0.0012638
trans, rtx2080ti, cuda, u64, 5000, 5000, 1, 0.00249188
trans, rtx2080ti, cuda, u64, 5000, 5000, 2, 0.00149606
trans, rtx2080ti, cuda, u64, 5000, 5000, 3, 0.00125933
trans, rtx2080ti, cuda, u64, 5000, 5000, 4, 0.00126158
trans, rtx2080ti, cpu, s32, 5000, 5000, 0, 0.0449638
trans, rtx2080ti, cpu, s32, 5000, 5000, 1, 0.0449194
trans, rtx2080ti, cpu, s32, 5000, 5000, 2, 0.0450336
trans, rtx2080ti, cpu, s32, 5000, 5000, 3, 0.0443406
trans, rtx2080ti, cpu, s32, 5000, 5000, 4, 0.0442285
trans, rtx2080ti, opencl, s32, 5000, 5000, 0, 0.00100998
trans, rtx2080ti, opencl, s32, 5000, 5000, 1, 0.00100126
trans, rtx2080ti, opencl, s32, 5000, 5000, 2, 0.000997774
trans, rtx2080ti, opencl, s32, 5000, 5000, 3, 0.000997256
trans, rtx2080ti, opencl, s32, 5000, 5000, 4, 0.00207912
trans, rtx2080ti, cuda, s32, 5000, 5000, 0, 0.000934726
trans, rtx2080ti, cuda, s32, 5000, 5000, 1, 0.00094195
trans, rtx2080ti, cuda, s32, 5000, 5000, 2, 0.000935265
trans, rtx2080ti, cuda, s32, 5000, 5000, 3, 0.00103965
trans, rtx2080ti, cuda, s32, 5000, 5000, 4, 0.00211693
trans, rtx2080ti, cpu, s64, 5000, 5000, 0, 0.0681599
trans, rtx2080ti, cpu, s64, 5000, 5000, 1, 0.0683435
trans, rtx2080ti, cpu, s64, 5000, 5000, 2, 0.0680057
trans, rtx2080ti, cpu, s64, 5000, 5000, 3, 0.067964
trans, rtx2080ti, cpu, s64, 5000, 5000, 4, 0.0680015
trans, rtx2080ti, opencl, s64, 5000, 5000, 0, 0.00218453
trans, rtx2080ti, opencl, s64, 5000, 5000, 1, 0.00139009
trans, rtx2080ti, opencl, s64, 5000, 5000, 2, 0.00148302
trans, rtx2080ti, opencl, s64, 5000, 5000, 3, 0.00199946
trans, rtx2080ti, opencl, s64, 5000, 5000, 4, 0.00137434
trans, rtx2080ti, cuda, s64, 5000, 5000, 0, 0.00125693
trans, rtx2080ti, cuda, s64, 5000, 5000, 1, 0.00248999
trans, rtx2080ti, cuda, s64, 5000, 5000, 2, 0.00126126
trans, rtx2080ti, cuda, s64, 5000, 5000, 3, 0.00126211
trans, rtx2080ti, cuda, s64, 5000, 5000, 4, 0.0012548

I'll merge this in a couple days, and then return to the vectorise() MR.

Merge request reports