Multi-threaded packet launching for CUDAAccel
When using CUDAAccel we pre-launch the packets in the host and send them to the CUDA kernel as an argument. Currently this code uses the same SSE code as the software version BUT it is single threaded. The launch time is negligible for a small amount of simple sources, but I can have a measurable effect when more complicated sources are used. Since this is simply a single for-loop it should be quite simple to parallelize the loop (in the launch
function in TetraMCCUDAKernel.hpp).
Edited by Tanner Young-Schultz