Skip to content

On-demand kernel compilation

This MR reworks the compilation infrastructure so that kernels are only compiled when they're actually used. This means that there is not a huge delay the first time that Bandicoot is run. It also resolves the issue where adding new kernels, or worse, new types, causes significant increases in first-time compilation cost. It thus opens the way for two further lines of development:

  • adding all kinds of new types (u8/s8/u16/s16/fp16/bf16/other types)
  • generating kernels for compilation based on expressions

In my timing simulations, I actually notice no measureable difference in the time to run bandicoot_test regardless of whether cached kernels are available. So, compiling all the kernels at once vs. one at a time when they're needed makes no real difference, and loading cached kernels all at once vs. one at a time also makes no real difference. I was surprised by that result but I'm happy with it for sure.

I was able to simplify some code relating to substituting types into kernels, since now I can just use macros and the OpenCL compiler for that. I also replaced the #pragma unroll extension checks for the OpenCL kernels, since they would cause warnings; OpenCL 2.0+ has a builtin attribute that does the same thing. Some other warnings relating to the min/max values of types were also fixed.

Merge request reports

Loading