Add support for CUDA
I've written a iterator for B23/S3 on CUDA. It obviously needs a lot of work units to keep busy, but it runs about 1000 times faster than LifeCube which is the basis for your AVXLife.
I used a GPU implementation running on a NVidia GTX 1080 to calculate a lookup table for all ancestors every 5x5 pattern (2^49 patterns).
This takes about 23 hours to complete (18 hours on a dual 980).
If you can make an API that allows for the transfer of data to process to the GPU. I should be able to provide code for a CUDA dll that allows speedup of the tile calculations.