Skip to content

Better RNGs: randu, randi, randn

Ryan Curtin requested to merge rcurtin/bandicoot-code:better-rng into unstable

I built on @zoq's work with the Philox generator to implement support for randn(), randu(), and randi(). I didn't do randg() because there's actually a lot of complexity there (rejection sampling is involved), and I think we can leave that for later.

I'll let this sit for a couple of days to see if there are any comments (please don't feel obligated), then I'll merge it.

Shortlist of changes:

  • Use cuRand for randu() with the CUDA backend.
  • Use cuRand for randn() with the CUDA backend.
  • Add inplace_mod_scalar kernel to perform integer modulus.
  • Use cuRand plus some modulo and transformations for randi() with the CUDA backend.
  • Implement XORWOW-32 and XORWOW-64 RNGs for randu() with the OpenCL backend.
  • Adapt @zoq's Philox CUDA implementation to OpenCL for randn().
  • Use XORWOW RNGs plus a custom kernel for randi() with the OpenCL backend.
  • Add/formalize the concept of "zeroway" kernels, which are specific to an element type. This includes the OpenCL utilities for random number generation.

Now, some benchmarks (raw output from benchmark programs randu, randn, and randi), all for 5000x5000 matrices:

randu

task, device, backend, element type, rows, cols, trial number, time

randu, rtx2080ti, cpu, float, 5000, 5000, 0, 0.314664
randu, rtx2080ti, cpu, float, 5000, 5000, 1, 0.323283
randu, rtx2080ti, cpu, float, 5000, 5000, 2, 0.314924
randu, rtx2080ti, cpu, float, 5000, 5000, 3, 0.314201
randu, rtx2080ti, cpu, float, 5000, 5000, 4, 0.31634
randu, rtx2080ti, opencl, float, 5000, 5000, 0, 0.000599097
randu, rtx2080ti, opencl, float, 5000, 5000, 1, 0.000389761
randu, rtx2080ti, opencl, float, 5000, 5000, 2, 0.000382144
randu, rtx2080ti, opencl, float, 5000, 5000, 3, 0.000384653
randu, rtx2080ti, opencl, float, 5000, 5000, 4, 0.00037674
randu, rtx2080ti, cuda, float, 5000, 5000, 0, 0.000526815
randu, rtx2080ti, cuda, float, 5000, 5000, 1, 0.000279331
randu, rtx2080ti, cuda, float, 5000, 5000, 2, 0.000269897
randu, rtx2080ti, cuda, float, 5000, 5000, 3, 0.000276838
randu, rtx2080ti, cuda, float, 5000, 5000, 4, 0.00074676
randu, rtx2080ti, cpu, double, 5000, 5000, 0, 0.308204
randu, rtx2080ti, cpu, double, 5000, 5000, 1, 0.308571
randu, rtx2080ti, cpu, double, 5000, 5000, 2, 0.308298
randu, rtx2080ti, cpu, double, 5000, 5000, 3, 0.305397
randu, rtx2080ti, cpu, double, 5000, 5000, 4, 0.307356
randu, rtx2080ti, opencl, double, 5000, 5000, 0, 0.00120569
randu, rtx2080ti, opencl, double, 5000, 5000, 1, 0.000844121
randu, rtx2080ti, opencl, double, 5000, 5000, 2, 0.000816368
randu, rtx2080ti, opencl, double, 5000, 5000, 3, 0.000815082
randu, rtx2080ti, opencl, double, 5000, 5000, 4, 0.000827565
randu, rtx2080ti, cuda, double, 5000, 5000, 0, 0.00082016
randu, rtx2080ti, cuda, double, 5000, 5000, 1, 0.000706266
randu, rtx2080ti, cuda, double, 5000, 5000, 2, 0.00070434
randu, rtx2080ti, cuda, double, 5000, 5000, 3, 0.000703828
randu, rtx2080ti, cuda, double, 5000, 5000, 4, 0.00121331
randu, rtx2080ti, cpu, u32, 5000, 5000, 0, 0.311288
randu, rtx2080ti, cpu, u32, 5000, 5000, 1, 0.309561
randu, rtx2080ti, cpu, u32, 5000, 5000, 2, 0.309505
randu, rtx2080ti, cpu, u32, 5000, 5000, 3, 0.310701
randu, rtx2080ti, cpu, u32, 5000, 5000, 4, 0.310553
randu, rtx2080ti, opencl, u32, 5000, 5000, 0, 0.00104401
randu, rtx2080ti, opencl, u32, 5000, 5000, 1, 0.000369118
randu, rtx2080ti, opencl, u32, 5000, 5000, 2, 0.00036775
randu, rtx2080ti, opencl, u32, 5000, 5000, 3, 0.000416463
randu, rtx2080ti, opencl, u32, 5000, 5000, 4, 0.000362726
randu, rtx2080ti, cuda, u32, 5000, 5000, 0, 0.000789701
randu, rtx2080ti, cuda, u32, 5000, 5000, 1, 0.000679029
randu, rtx2080ti, cuda, u32, 5000, 5000, 2, 0.000679865
randu, rtx2080ti, cuda, u32, 5000, 5000, 3, 0.000676997
randu, rtx2080ti, cuda, u32, 5000, 5000, 4, 0.00130112
randu, rtx2080ti, cpu, u64, 5000, 5000, 0, 0.324781
randu, rtx2080ti, cpu, u64, 5000, 5000, 1, 0.31846
randu, rtx2080ti, cpu, u64, 5000, 5000, 2, 0.314874
randu, rtx2080ti, cpu, u64, 5000, 5000, 3, 0.315849
randu, rtx2080ti, cpu, u64, 5000, 5000, 4, 0.316885
randu, rtx2080ti, opencl, u64, 5000, 5000, 0, 0.00108205
randu, rtx2080ti, opencl, u64, 5000, 5000, 1, 0.00141024
randu, rtx2080ti, opencl, u64, 5000, 5000, 2, 0.000784686
randu, rtx2080ti, opencl, u64, 5000, 5000, 3, 0.000762174
randu, rtx2080ti, opencl, u64, 5000, 5000, 4, 0.000763266
randu, rtx2080ti, cuda, u64, 5000, 5000, 0, 0.00164102
randu, rtx2080ti, cuda, u64, 5000, 5000, 1, 0.00152981
randu, rtx2080ti, cuda, u64, 5000, 5000, 2, 0.00151395
randu, rtx2080ti, cuda, u64, 5000, 5000, 3, 0.0021971
randu, rtx2080ti, cuda, u64, 5000, 5000, 4, 0.00162329

Short story:

  • ~4 orders of magnitude faster than the CPU; this in part has to do with the extreme simplicity of the XORWOW generator.
  • OpenCL float/double performance is just a little slower than cuRand (I'm happy with that!).
  • OpenCL integer types are faster, simply because the cuRand implementation only produces floats/doubles, and must be converted to integer types in a second pass.

randn

task, device, backend, element type, rows, cols, trial number, time

randn, rtx2080ti, cpu, float, 5000, 5000, 0, 0.591816
randn, rtx2080ti, cpu, float, 5000, 5000, 1, 0.583793
randn, rtx2080ti, cpu, float, 5000, 5000, 2, 0.610927
randn, rtx2080ti, cpu, float, 5000, 5000, 3, 0.598189
randn, rtx2080ti, cpu, float, 5000, 5000, 4, 0.58282
randn, rtx2080ti, opencl, float, 5000, 5000, 0, 0.00158159
randn, rtx2080ti, opencl, float, 5000, 5000, 1, 0.00135531
randn, rtx2080ti, opencl, float, 5000, 5000, 2, 0.00144557
randn, rtx2080ti, opencl, float, 5000, 5000, 3, 0.00145623
randn, rtx2080ti, opencl, float, 5000, 5000, 4, 0.00133763
randn, rtx2080ti, cuda, float, 5000, 5000, 0, 0.000508964
randn, rtx2080ti, cuda, float, 5000, 5000, 1, 0.000290162
randn, rtx2080ti, cuda, float, 5000, 5000, 2, 0.000273993
randn, rtx2080ti, cuda, float, 5000, 5000, 3, 0.000284947
randn, rtx2080ti, cuda, float, 5000, 5000, 4, 0.000278929
randn, rtx2080ti, cpu, double, 5000, 5000, 0, 0.595662
randn, rtx2080ti, cpu, double, 5000, 5000, 1, 0.575276
randn, rtx2080ti, cpu, double, 5000, 5000, 2, 0.600488
randn, rtx2080ti, cpu, double, 5000, 5000, 3, 0.593243
randn, rtx2080ti, cpu, double, 5000, 5000, 4, 0.579955
randn, rtx2080ti, opencl, double, 5000, 5000, 0, 0.0307226
randn, rtx2080ti, opencl, double, 5000, 5000, 1, 0.0304636
randn, rtx2080ti, opencl, double, 5000, 5000, 2, 0.0293387
randn, rtx2080ti, opencl, double, 5000, 5000, 3, 0.0291127
randn, rtx2080ti, opencl, double, 5000, 5000, 4, 0.0233058
randn, rtx2080ti, cuda, double, 5000, 5000, 0, 0.00536012
randn, rtx2080ti, cuda, double, 5000, 5000, 1, 0.00425261
randn, rtx2080ti, cuda, double, 5000, 5000, 2, 0.00530492
randn, rtx2080ti, cuda, double, 5000, 5000, 3, 0.00676508
randn, rtx2080ti, cuda, double, 5000, 5000, 4, 0.00429673
randn, rtx2080ti, cpu, u32, 5000, 5000, 0, 0.592647
randn, rtx2080ti, cpu, u32, 5000, 5000, 1, 0.589459
randn, rtx2080ti, cpu, u32, 5000, 5000, 2, 0.597422
randn, rtx2080ti, cpu, u32, 5000, 5000, 3, 0.589147
randn, rtx2080ti, cpu, u32, 5000, 5000, 4, 0.59219
randn, rtx2080ti, opencl, u32, 5000, 5000, 0, 0.00168049
randn, rtx2080ti, opencl, u32, 5000, 5000, 1, 0.00140178
randn, rtx2080ti, opencl, u32, 5000, 5000, 2, 0.00137218
randn, rtx2080ti, opencl, u32, 5000, 5000, 3, 0.00136348
randn, rtx2080ti, opencl, u32, 5000, 5000, 4, 0.00138486
randn, rtx2080ti, cuda, u32, 5000, 5000, 0, 0.000806589
randn, rtx2080ti, cuda, u32, 5000, 5000, 1, 0.000719439
randn, rtx2080ti, cuda, u32, 5000, 5000, 2, 0.000695759
randn, rtx2080ti, cuda, u32, 5000, 5000, 3, 0.000696068
randn, rtx2080ti, cuda, u32, 5000, 5000, 4, 0.000691295
randn, rtx2080ti, cpu, u64, 5000, 5000, 0, 0.609566
randn, rtx2080ti, cpu, u64, 5000, 5000, 1, 0.589365
randn, rtx2080ti, cpu, u64, 5000, 5000, 2, 0.608482
randn, rtx2080ti, cpu, u64, 5000, 5000, 3, 0.601375
randn, rtx2080ti, cpu, u64, 5000, 5000, 4, 0.612364
randn, rtx2080ti, opencl, u64, 5000, 5000, 0, 0.00237658
randn, rtx2080ti, opencl, u64, 5000, 5000, 1, 0.0020103
randn, rtx2080ti, opencl, u64, 5000, 5000, 2, 0.00208289
randn, rtx2080ti, opencl, u64, 5000, 5000, 3, 0.00201645
randn, rtx2080ti, opencl, u64, 5000, 5000, 4, 0.00201588
randn, rtx2080ti, cuda, u64, 5000, 5000, 0, 0.0122917
randn, rtx2080ti, cuda, u64, 5000, 5000, 1, 0.00904451
randn, rtx2080ti, cuda, u64, 5000, 5000, 2, 0.00748378
randn, rtx2080ti, cuda, u64, 5000, 5000, 3, 0.0121116
randn, rtx2080ti, cuda, u64, 5000, 5000, 4, 0.00861569

Short story:

  • 2-3 orders of magnitude faster than the CPU, in part because we are using a simpler RNG than the Mersenne twister.
  • OpenCL float/double performance is slower than the cuRand implementation, but not that far off.
  • For CUDA integer types, there are extra passes over the matrix we have to do to convert back to the integer type, so it's actually just a little slower.

randi (in range [100, 500])

task, device, backend, element type, rows, cols, trial number, time

randi, rtx2080ti, cpu, float, 5000, 5000, 0, 0.194668
randi, rtx2080ti, cpu, float, 5000, 5000, 1, 0.195344
randi, rtx2080ti, cpu, float, 5000, 5000, 2, 0.199487
randi, rtx2080ti, cpu, float, 5000, 5000, 3, 0.196128
randi, rtx2080ti, cpu, float, 5000, 5000, 4, 0.195068
randi, rtx2080ti, opencl, float, 5000, 5000, 0, 0.000675889
randi, rtx2080ti, opencl, float, 5000, 5000, 1, 0.000422235
randi, rtx2080ti, opencl, float, 5000, 5000, 2, 0.000411852
randi, rtx2080ti, opencl, float, 5000, 5000, 3, 0.000421516
randi, rtx2080ti, opencl, float, 5000, 5000, 4, 0.000419949
randi, rtx2080ti, cuda, float, 5000, 5000, 0, 0.00182007
randi, rtx2080ti, cuda, float, 5000, 5000, 1, 0.0023648
randi, rtx2080ti, cuda, float, 5000, 5000, 2, 0.00159865
randi, rtx2080ti, cuda, float, 5000, 5000, 3, 0.00158884
randi, rtx2080ti, cuda, float, 5000, 5000, 4, 0.00220668
randi, rtx2080ti, cpu, double, 5000, 5000, 0, 0.205512
randi, rtx2080ti, cpu, double, 5000, 5000, 1, 0.204601
randi, rtx2080ti, cpu, double, 5000, 5000, 2, 0.204348
randi, rtx2080ti, cpu, double, 5000, 5000, 3, 0.206919
randi, rtx2080ti, cpu, double, 5000, 5000, 4, 0.205936
randi, rtx2080ti, opencl, double, 5000, 5000, 0, 0.00210586
randi, rtx2080ti, opencl, double, 5000, 5000, 1, 0.00260872
randi, rtx2080ti, opencl, double, 5000, 5000, 2, 0.00250073
randi, rtx2080ti, opencl, double, 5000, 5000, 3, 0.00243068
randi, rtx2080ti, opencl, double, 5000, 5000, 4, 0.00243036
randi, rtx2080ti, cuda, double, 5000, 5000, 0, 0.00581644
randi, rtx2080ti, cuda, double, 5000, 5000, 1, 0.00319957
randi, rtx2080ti, cuda, double, 5000, 5000, 2, 0.00316276
randi, rtx2080ti, cuda, double, 5000, 5000, 3, 0.00499568
randi, rtx2080ti, cuda, double, 5000, 5000, 4, 0.00335625
randi, rtx2080ti, cpu, u32, 5000, 5000, 0, 0.195196
randi, rtx2080ti, cpu, u32, 5000, 5000, 1, 0.198511
randi, rtx2080ti, cpu, u32, 5000, 5000, 2, 0.195301
randi, rtx2080ti, cpu, u32, 5000, 5000, 3, 0.193952
randi, rtx2080ti, cpu, u32, 5000, 5000, 4, 0.193012
randi, rtx2080ti, opencl, u32, 5000, 5000, 0, 0.000853076
randi, rtx2080ti, opencl, u32, 5000, 5000, 1, 0.00137391
randi, rtx2080ti, opencl, u32, 5000, 5000, 2, 0.00138221
randi, rtx2080ti, opencl, u32, 5000, 5000, 3, 0.00138316
randi, rtx2080ti, opencl, u32, 5000, 5000, 4, 0.00138459
randi, rtx2080ti, cuda, u32, 5000, 5000, 0, 0.00130594
randi, rtx2080ti, cuda, u32, 5000, 5000, 1, 0.00117233
randi, rtx2080ti, cuda, u32, 5000, 5000, 2, 0.0019279
randi, rtx2080ti, cuda, u32, 5000, 5000, 3, 0.00172964
randi, rtx2080ti, cuda, u32, 5000, 5000, 4, 0.00127995
randi, rtx2080ti, cpu, u64, 5000, 5000, 0, 0.205937
randi, rtx2080ti, cpu, u64, 5000, 5000, 1, 0.205542
randi, rtx2080ti, cpu, u64, 5000, 5000, 2, 0.208592
randi, rtx2080ti, cpu, u64, 5000, 5000, 3, 0.204521
randi, rtx2080ti, cpu, u64, 5000, 5000, 4, 0.206101
randi, rtx2080ti, opencl, u64, 5000, 5000, 0, 0.00179267
randi, rtx2080ti, opencl, u64, 5000, 5000, 1, 0.00283525
randi, rtx2080ti, opencl, u64, 5000, 5000, 2, 0.00222009
randi, rtx2080ti, opencl, u64, 5000, 5000, 3, 0.0021565
randi, rtx2080ti, opencl, u64, 5000, 5000, 4, 0.00215735
randi, rtx2080ti, cuda, u64, 5000, 5000, 0, 0.00425219
randi, rtx2080ti, cuda, u64, 5000, 5000, 1, 0.00234799
randi, rtx2080ti, cuda, u64, 5000, 5000, 2, 0.00231953
randi, rtx2080ti, cuda, u64, 5000, 5000, 3, 0.00430835
randi, rtx2080ti, cuda, u64, 5000, 5000, 4, 0.00840221

Short story:

  • Interesting that CPU randi() is faster than both randu() and randn(). I didn't look into that at all.
  • Similar 3-4 order of magnitude speedup for GPU implementation, likely for the same reasons.
  • CUDA is only faster for float; this is almost certainly because we take up to three passes over the matrix for randi(), instead of writing custom kernels.
  • OpenCL is faster for double and all integer types, since it's all done in one pass.

Merge request reports