Skip to content

Support for subgroups on nvidia OpenCL devices

I was surprised to find that my nvidia OpenCL driver simply does not support subgroups despite it being a part of the OpenCL standard. However, nvidia devices still have warps, and the nvidia OpenCL programming guide even recommends warp-synchronous programming without memory fences (so long as the memory is marked volatile). So, instead of disabling subgroup support and forcing more-expensive barrier(CLK_LOCAL_MEM_FENCE) calls throughout, this code instead checks if we are on an nvidia device, gets the warp size, and uses that as the subgroup size. Subgroup barriers become no-ops.

I thought I would see a noticeable speedup in the OpenCL accu benchmark, but I actually did not! I wonder if barrier(CLK_LOCAL_MEM_FENCE) is simply not expensive.

In any case, this is still a thing that I think is necessary.

Merge request reports