Implement conv_to() (!6) · Merge requests · bandicoot-lib / bandicoot-code

Ryan Curtin requested to merge rcurtin/bandicoot-code:conv_to into unstable Jun 14, 2020

Whew, this quickly turned into a slog. The scope was much more than I originally anticipated. In any case, this MR implements the conv_to<> operation, so now one can do things like this:

Mat<float> x(10, 10);
x.fill(5);
Mat<double> y = conv_to<Mat<float>>::from(x);

This also works on arbitrary operations, so, e.g., I can do

Mat<double> y = conv_to<Mat<float>>::from(x + x - 3);

This, then, solves #15 (closed).

Now, let's talk about why the scope creep happened...

Ideally, if I did something as simple as Mat<double> y = conv_to<Mat<float>>::from(x), I'd like this to execute as one kernel, performing the casting simultaneously to copying the elements of x to y. That's not too hard to implement in a standalone way. But I'd also like expressions like y = conv_to<Mat<float>>::from(2 * x) to also only use one kernel! The ability to do this with only one call will affect the benchmarks I am starting to put together for #9 (closed), #10, #11, and #12. This turns out to be a little less trivial, and that desire drove me to overhaul quite a lot of the code. Here's a summary of the implications:

All objects (Base, Op, Glue, eOp, eGlue), now take an additional template parameter eT. Specifically, for something like Op<eT, T1, op_type> this means that eT can be different than typename T1::elem_type.
unwrap<> now first checks to see if there are any op_conv_tos in the expression to be unwrapped, and if so, the code tries to "merge" operations. As an example, Op<double, eOp<float, Mat<float>, op_square>, op_conv_to> can be simplified to eOp<double, Mat<float>, op_square>.
An implication of this is that now we have to have "multi-way" kernels, that can take in an eT1 and output an eT2, or similar. This causes an explosion in the number of kernels we compile (hence #18), but, honestly, this explosion was already on the way to happening anyway. Significant refactoring was done to the CUDA and OpenCL runtimes (including factoring out common bits into rt_common/) to make this work as efficiently as possible---which seems to mean gathering all of the kernels at once and then compiling them.
A follow-up implication is that since there are many more kernels, the initial compilation of kernels when we start a program is longer. On my system, I've managed to get this down to 3-4 seconds, which is not unmanageable and isn't a show-stopper (for now). It's likely this will need to be revisited as time goes on (again, hence #18).
We have to add special overloads to Mat<> and subview<> to catch operators where a conversion operation is passed in. This is so that we can call the correct kernel that takes the given input type and copies over to the type of the destination Mat (or subview). It feels a little awkward to specifically reference op_conv_to in the Mat and subview classes, but I haven't seen any way around it. Perhaps a later thought will lend itself to a cleaned-up design?
Allowing input and output types to be different for bandicoot objects means that the mtGlue and mtOp objects are superfluous---so I've removed the couple of references to them.

Whew. I hope the next thing I do isn't so much work. This took lots and lots of evenings and weekends...

Implement conv_to()

Merge request reports