Support for subgroups on nvidia OpenCL devices
I was surprised to find that my nvidia OpenCL driver simply does not support subgroups despite it being a part of the OpenCL standard. However, nvidia devices still have warps, and the nvidia OpenCL programming guide even recommends warp-synchronous programming without memory fences (so long as the memory is marked volatile
). So, instead of disabling subgroup support and forcing more-expensive barrier(CLK_LOCAL_MEM_FENCE)
calls throughout, this code instead checks if we are on an nvidia device, gets the warp size, and uses that as the subgroup size. Subgroup barriers become no-ops.
I thought I would see a noticeable speedup in the OpenCL accu
benchmark, but I actually did not! I wonder if barrier(CLK_LOCAL_MEM_FENCE)
is simply not expensive.
In any case, this is still a thing that I think is necessary.