Subview optimizations, round one (!123) · Merge requests · bandicoot-lib / bandicoot-code

Ryan Curtin requested to merge rcurtin/bandicoot-code:subview-optimizations into unstable Sep 05, 2023

When I originally wrote a lot of the Bandicoot kernels, I tried to avoid having too many kernel arguments, for fear that this would introduce too much overhead. Under this line of thinking, I wrote a number of kernels specific to subviews. However, then I started experimenting, and recognized two things:

The number of arguments to a kernel does not appear to affect runtime in any noticeable way
GPU cores are so memory-bound that extra work involved with computing offsets is pretty irrelevant

So, instead of having kernels of the form

__kernel void inplace_plus(const float* mem,
                           const float val,
                           const uword n_elem)
{
  const uword index = get_global_id(0);
  if (index < n_elem)
    mem[index] += val;
}

it is equivalently fast to have

__kernel void inplace_plus(const float* mem,
                           const float val,
                           const uword n_rows,
                           const uword n_cols,
                           const uword start_row,
                           const uword end_row,
                           const uword M_n_rows)
{
  const uword row = get_global_id(0);
  const uword col = get_global_id(1);
  if (row < n_rows && col < n_cols)
  {
    const uword index = (start_row + row) + (start_col + col) * M_n_rows;
    mem[index] += val;
  }
}

and this has the important benefit that now all of these kernels can operate on subviews directly. There were many situations inside of Bandicoot where subviews would be extracted and then the operation would be done, such as the operation x += y.submat(...) + scalar and other similar elementwise operations.

Previously, running that x = y.submat(...) + scalar operation on a 1k x 1k matrix took (on my machine) the following amounts of time (average of 5 runs):

type, rows, cols, backend, avg. runtime
 mat, 1000, 1000, cuda, 0.000373677
fmat, 1000, 1000, cuda, 0.000143279
imat, 1000, 1000, cuda, 0.000170638
 mat, 1000, 1000, opencl, 0.00017976
fmat, 1000, 1000, opencl, 0.000455382
imat, 1000, 1000, opencl, 0.000175067

and now with these changes:

type, rows, cols, backend, avg. runtime
 mat, 1000, 1000, cuda, 2.0874e-06
fmat, 1000, 1000, cuda, 2.0698e-06
imat, 1000, 1000, cuda, 2.0054e-06
 mat, 1000, 1000, opencl, 6.0768e-06
fmat, 1000, 1000, opencl, 5.7244e-06
imat, 1000, 1000, opencl, 5.9594e-06

So, this MR then contains:

an adaptation of all oneway in-place kernels to twoway kernels with subview offset support
a simplification of related coot_rt_t backend functions
removal of now-unused subview-specific in-place kernels

I also tested other operations not involving subviews (to see the effect of more kernel arguments) and was not able to find any measurable difference in runtime.

This is the first of a handful of MRs that allow direct support for subviews without extraction.

Subview optimizations, round one

Merge request reports