Skip to content

exhaustFirstTile() triggers "CUDA Exception: Warp Illegal Address" when running apgluxe on an 80GB A100

I believe i found a bug in lifelib when running apgluxe on an A100 GPU with 80GB. Upon recompiling with --cuda (gcc 8.5.0/cuda 12.1) and running apgluxe on a system with an NVidia A100 80GB i get multiple "CUDA Error 700" messages:

Interesting universes: 0 out of 1000000
CUDA Error 700 : an illegal memory access was encountered
Interesting universes: 0 out of 1000000
CUDA Error 700 : an illegal memory access was encountered
b3s23/G1: 1000000 soups completed (200000000.000 soups/second current, 200000000.000 overall).

...etc. The program crashes soon after that..

V100 and A40 GPU systems are running fine so i suspected a memory size issue. Debugging with cuda-gdb resulted in:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x155537467cd0

Thread 1 "apgluxe" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 5, block (131507,0,0), thread (32,0,0), device 0, sm 0, warp 10, lane 0]
0x0000155537462e30 in exhaustFirstTile(unsigned int*, unsigned int*, uint4*)<<<(250000,1,1),(128,1,1)>>> ()

This looks like an overflow in the first dimension, which is derived from a variable called minibatch and is initialized in cuda2/gs_impl.h depending on the reported memory size of the GPU card.

Lowering the minibatch value for A100/H100 system from 1000000 to 500000 fixes the error warnings and apgluxe runs.

For the record, the output of nvidia-smi on said system:

[feverdij@gpu011 apgmera]$ nvidia-smi
Fri Jan  5 14:41:55 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:16:00.0 Off |                    0 |
| N/A   31C    P0              41W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
Edited by Frank Everdij