Skip to content

M1 Mac OpenCL fixes

This is a whole pile of fixes for M1 Macs and other OpenCL implementations. When writing CUDA kernels, Bandicoot does a lot of inter-warp synchronization, especially in the sorting kernels. But in OpenCL, the equivalent (inter-workgroup synchronization) is not possible. That means that I have to do the synchronization at the CPU level, which is irritating and tedious... but, that's what OpenCL provides, so...

The primary change here is to sort() and sort_index(); I had to add a few new kernels to split the different radix sort operations into steps.

I'll review this deeply and merge by the end of the week, then release.

Merge request reports