On-demand kernels

The work on conv_to<> with #15 (closed) means that we'll end up with a bunch of kernels of each type. Even for our existing set of kernels (before I open an MR for that), using NVRTC to compile all of our CUDA kernels means a ~0.5-1s lag every time we start Bandicoot, since we are compiling kernels for each type that we support.

As the number of kernels we support expands, and as conv_to<> is implemented, we may find that we have quite a lot of kernels very quickly. For instance... "embedding" conv_to<> into kernels via kernels like this becomes necessary for fast code:

// is effectively equivalent to out = conv_to<Mat<eT2>>::from(in)
__kernel__ void op(eT2* out, eT1* in)
  {
  ...
  }

and there are even places where we might have "three-way" kernels too. Very quickly this becomes a huge startup burden, especially since many of these kernels aren't being used at all. This problem seems somewhat unavoidable; even the ~0.5s startup lag we currently have now is undesirable in my opinion.

I played around with the number of kernels being compiled by NVRTC. My branch is in a state where doing the same for OpenCL would be... tricky, but those numbers are likely about the same; I'll get them later. (Although, if OpenCL compilation is super fast for some reason, then I guess we don't need to solve the problem for that backend.)

If I compile only one CUDA kernel with NVRTC, it takes roughly 0.1s.
If I compile approximately 100 CUDA kernels in one shot with NVRTC, it takes roughly 0.25s.
If I compile approximately 1000 CUDA kernels in one shot with NVRTC, it takes roughly 2.5s.
If I compile a bit less than 2000 CUDA kernels in one shot with NVRTC, it takes roughly 4.1s.

So I see a couple of possibilities to approach this:

Compile each kernel on-demand; this is easy to do via the infrastructure we already have. Basically, we simply check when we get a kernel (a CUfunction or cl_kernel) if we've already compiled it; if not, call out to NVRTC or OpenCL to make it, and then return the compiled kernel. Just as a quick off-the-wall data point, the runtime to start bandicoot when I have it compile only one CUDA kernel via NVRTC is 0.1s, which isn't unreasonable. I would expect most users to be working with only a handful of kernels.
Compile some "sets" of kernels on-demand; i.e., when a user calls accu() with eT = float, compile all float kernels. That can help keep the cost of the overhead of starting NVRTC (that ~0.1s) from hurting too much. Or, whenever a user makes a Mat<eT> make sure all kernels are compiled for eT, etc...
See if we have a way to "cache" kernels on disk. This is of course only relevant if the device being used is the exact same each run---which may not be the case. Support like that would then allow us to, e.g., compile all possible kernels the first time that bandicoot is run (via some user-specified flag), then there is no more kernel compilation overhead.

For now, I'm going to do my best to make things fast in the non-on-demand framework that we have, but, there will be a lot of kernels, so the startup time will be slow. We can then revisit this problem orthogonal to other changes that happen, in any of the three ways above.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information