Reorganize CI to test on more GPUs and CUDA versions
My goal is to test:
- with different CUDA toolkits
- with/without C++23 enabled (because fp16 is defined differently and other behavior may differ)
- with OpenCL on non-nvidia GPUs (preferably AMD/Intel)
We will see if I can manage to do all of those things without exploding the build time too much.