Use OpenCL subgroups instead of "wavefronts" (!73) · Merge requests · bandicoot-lib / bandicoot-code

Ryan Curtin requested to merge rcurtin/bandicoot-code:use-subgroups-opencl into unstable May 14, 2023

Pretty much all of my development has been on nvidia GPUs up to this point, so it's not a huge surprise to me that a whole pile of tests failed when I finally got the Bandicoot tests running on my no-fp64-support Intel laptop GPU.

It turns out that during the process of implementing OpenCL support, I came across the idea of "wavefront", and thought it was the same as a CUDA "warp", which is a block of threads that runs in lock-step where you can do synchronous (barrier-free) programming for speed*. However, "wavefront" is an AMD GPU concept, not an OpenCL concept. In OpenCL, it turns out the similar concept is called "subgroups", and was not a part of the OpenCL 1.2 standard. In fact, in OpenCL 1.2, it was only available through an Intel extension (cl_intel_subgroups), and in OpenCL 2.0, it is only available from a Khronos group extension (cl_khr_subgroups). It only became standard in OpenCL 2.1.

Anyway, I added support to the opencl::runtime_t to detect subgroup size. If subgroups are not available, then the assumed subgroup size is 0, and any subgroup barriers (which are cheap) become expensive workgroup barriers. But that is better than returning incorrect results... I think the vast majority of OpenCL devices will support subgroups anyway, so I'm not too worried about that fallback.

I also had to adapt basically every reduce kernel (easy enough really).

My poor Intel GPU might not be the best device to run bandicoot on. Here are the raw results for 5 trials of fp32 dot products:

benchmark, name, device, type, n_rows, n_cols, trial, runtime, bandwidth
dot, intel_iris_xe, cpu, float, 1000000, 1, 0, 0.000324703, 22.9458
dot, intel_iris_xe, cpu, float, 1000000, 1, 1, 0.000244469, 30.4766
dot, intel_iris_xe, cpu, float, 1000000, 1, 2, 0.000173057, 43.0528
dot, intel_iris_xe, cpu, float, 1000000, 1, 3, 0.000172687, 43.145
dot, intel_iris_xe, cpu, float, 1000000, 1, 4, 0.000163628, 45.5337
dot, intel_iris_xe, opencl, float, 1000000, 1, 0, 0.00147935, 5.03639
dot, intel_iris_xe, opencl, float, 1000000, 1, 1, 0.00144325, 5.16235
dot, intel_iris_xe, opencl, float, 1000000, 1, 2, 0.00133522, 5.58002
dot, intel_iris_xe, opencl, float, 1000000, 1, 3, 0.00137433, 5.42123
dot, intel_iris_xe, opencl, float, 1000000, 1, 4, 0.00122272, 6.09344

It turns out nvidia has declared warp-synchronous programming to be unsafe, about eight years ago, so I think I need to go add some warp-level barriers to those kernels too, to make them safe. I want to see if that has any ill runtime effects; probably not in most cases? (If the actual GPU runs warps synchronously, the barriers can be compiled out.)

Use OpenCL subgroups instead of "wavefronts"

Merge request reports