On-demand kernel compilation
This MR reworks the compilation infrastructure so that kernels are only compiled when they're actually used. This means that there is not a huge delay the first time that Bandicoot is run. It also resolves the issue where adding new kernels, or worse, new types, causes significant increases in first-time compilation cost. It thus opens the way for two further lines of development:
- adding all kinds of new types (u8/s8/u16/s16/fp16/bf16/other types)
- generating kernels for compilation based on expressions
In my timing simulations, I actually notice no measureable difference in the time to run bandicoot_test
regardless of whether cached kernels are available. So, compiling all the kernels at once vs. one at a time when they're needed makes no real difference, and loading cached kernels all at once vs. one at a time also makes no real difference. I was surprised by that result but I'm happy with it for sure.
I was able to simplify some code relating to substituting types into kernels, since now I can just use macros and the OpenCL compiler for that. I also replaced the #pragma unroll
extension checks for the OpenCL kernels, since they would cause warnings; OpenCL 2.0+ has a builtin attribute that does the same thing. Some other warnings relating to the min/max values of types were also fixed.