Implement conv_to()
Whew, this quickly turned into a slog. The scope was much more than I originally anticipated. In any case, this MR implements the conv_to<>
operation, so now one can do things like this:
Mat<float> x(10, 10);
x.fill(5);
Mat<double> y = conv_to<Mat<float>>::from(x);
This also works on arbitrary operations, so, e.g., I can do
Mat<double> y = conv_to<Mat<float>>::from(x + x - 3);
This, then, solves #15 (closed).
Now, let's talk about why the scope creep happened...
Ideally, if I did something as simple as Mat<double> y = conv_to<Mat<float>>::from(x)
, I'd like this to execute as one kernel, performing the casting simultaneously to copying the elements of x
to y
. That's not too hard to implement in a standalone way. But I'd also like expressions like y = conv_to<Mat<float>>::from(2 * x)
to also only use one kernel! The ability to do this with only one call will affect the benchmarks I am starting to put together for #9 (closed), #10, #11, and #12. This turns out to be a little less trivial, and that desire drove me to overhaul quite a lot of the code. Here's a summary of the implications:
-
All objects (
Base
,Op
,Glue
,eOp
,eGlue
), now take an additional template parametereT
. Specifically, for something likeOp<eT, T1, op_type>
this means thateT
can be different thantypename T1::elem_type
. -
unwrap<>
now first checks to see if there are anyop_conv_to
s in the expression to be unwrapped, and if so, the code tries to "merge" operations. As an example,Op<double, eOp<float, Mat<float>, op_square>, op_conv_to>
can be simplified toeOp<double, Mat<float>, op_square>
. -
An implication of this is that now we have to have "multi-way" kernels, that can take in an
eT1
and output aneT2
, or similar. This causes an explosion in the number of kernels we compile (hence #18), but, honestly, this explosion was already on the way to happening anyway. Significant refactoring was done to the CUDA and OpenCL runtimes (including factoring out common bits intort_common/
) to make this work as efficiently as possible---which seems to mean gathering all of the kernels at once and then compiling them. -
A follow-up implication is that since there are many more kernels, the initial compilation of kernels when we start a program is longer. On my system, I've managed to get this down to 3-4 seconds, which is not unmanageable and isn't a show-stopper (for now). It's likely this will need to be revisited as time goes on (again, hence #18).
-
We have to add special overloads to
Mat<>
andsubview<>
to catch operators where a conversion operation is passed in. This is so that we can call the correct kernel that takes the given input type and copies over to the type of the destinationMat
(orsubview
). It feels a little awkward to specifically referenceop_conv_to
in theMat
andsubview
classes, but I haven't seen any way around it. Perhaps a later thought will lend itself to a cleaned-up design? -
Allowing input and output types to be different for bandicoot objects means that the
mtGlue
andmtOp
objects are superfluous---so I've removed the couple of references to them.
Whew. I hope the next thing I do isn't so much work. This took lots and lots of evenings and weekends...