exhaustFirstTile() triggers "CUDA Exception: Warp Illegal Address" when running apgluxe on an 80GB A100
I believe i found a bug in lifelib when running apgluxe on an A100 GPU with 80GB. Upon recompiling with --cuda (gcc 8.5.0/cuda 12.1) and running apgluxe on a system with an NVidia A100 80GB i get multiple "CUDA Error 700" messages:
Interesting universes: 0 out of 1000000
CUDA Error 700 : an illegal memory access was encountered
Interesting universes: 0 out of 1000000
CUDA Error 700 : an illegal memory access was encountered
b3s23/G1: 1000000 soups completed (200000000.000 soups/second current, 200000000.000 overall).
...etc. The program crashes soon after that..
V100 and A40 GPU systems are running fine so i suspected a memory size issue. Debugging with cuda-gdb resulted in:
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x155537467cd0
Thread 1 "apgluxe" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 5, block (131507,0,0), thread (32,0,0), device 0, sm 0, warp 10, lane 0]
0x0000155537462e30 in exhaustFirstTile(unsigned int*, unsigned int*, uint4*)<<<(250000,1,1),(128,1,1)>>> ()
This looks like an overflow in the first dimension, which is derived from a variable called minibatch and is initialized in cuda2/gs_impl.h depending on the reported memory size of the GPU card.
Lowering the minibatch value for A100/H100 system from 1000000 to 500000 fixes the error warnings and apgluxe runs.
For the record, the output of nvidia-smi on said system:
[feverdij@gpu011 apgmera]$ nvidia-smi
Fri Jan 5 14:41:55 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:16:00.0 Off | 0 |
| N/A 31C P0 41W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+