On-demand kernels
The work on conv_to<>
with #15 (closed) means that we'll end up with a bunch of kernels of each type. Even for our existing set of kernels (before I open an MR for that), using NVRTC to compile all of our CUDA kernels means a ~0.5-1s lag every time we start Bandicoot, since we are compiling kernels for each type that we support.
As the number of kernels we support expands, and as conv_to<>
is implemented, we may find that we have quite a lot of kernels very quickly. For instance... "embedding" conv_to<>
into kernels via kernels like this becomes necessary for fast code:
// is effectively equivalent to out = conv_to<Mat<eT2>>::from(in)
__kernel__ void op(eT2* out, eT1* in)
{
...
}
and there are even places where we might have "three-way" kernels too. Very quickly this becomes a huge startup burden, especially since many of these kernels aren't being used at all. This problem seems somewhat unavoidable; even the ~0.5s startup lag we currently have now is undesirable in my opinion.
I played around with the number of kernels being compiled by NVRTC. My branch is in a state where doing the same for OpenCL would be... tricky, but those numbers are likely about the same; I'll get them later. (Although, if OpenCL compilation is super fast for some reason, then I guess we don't need to solve the problem for that backend.)
- If I compile only one CUDA kernel with NVRTC, it takes roughly 0.1s.
- If I compile approximately 100 CUDA kernels in one shot with NVRTC, it takes roughly 0.25s.
- If I compile approximately 1000 CUDA kernels in one shot with NVRTC, it takes roughly 2.5s.
- If I compile a bit less than 2000 CUDA kernels in one shot with NVRTC, it takes roughly 4.1s.
So I see a couple of possibilities to approach this:
-
Compile each kernel on-demand; this is easy to do via the infrastructure we already have. Basically, we simply check when we get a kernel (a
CUfunction
orcl_kernel
) if we've already compiled it; if not, call out to NVRTC or OpenCL to make it, and then return the compiled kernel. Just as a quick off-the-wall data point, the runtime to start bandicoot when I have it compile only one CUDA kernel via NVRTC is 0.1s, which isn't unreasonable. I would expect most users to be working with only a handful of kernels. -
Compile some "sets" of kernels on-demand; i.e., when a user calls
accu()
witheT = float
, compile allfloat
kernels. That can help keep the cost of the overhead of starting NVRTC (that ~0.1s) from hurting too much. Or, whenever a user makes aMat<eT>
make sure all kernels are compiled foreT
, etc... -
See if we have a way to "cache" kernels on disk. This is of course only relevant if the device being used is the exact same each run---which may not be the case. Support like that would then allow us to, e.g., compile all possible kernels the first time that bandicoot is run (via some user-specified flag), then there is no more kernel compilation overhead.
For now, I'm going to do my best to make things fast in the non-on-demand framework that we have, but, there will be a lot of kernels, so the startup time will be slow. We can then revisit this problem orthogonal to other changes that happen, in any of the three ways above.