Multi-threaded packet launching for CUDAAccel

When using CUDAAccel we pre-launch the packets in the host and send them to the CUDA kernel as an argument. Currently this code uses the same SSE code as the software version BUT it is single threaded. The launch time is negligible for a small amount of simple sources, but I can have a measurable effect when more complicated sources are used. Since this is simply a single for-loop it should be quite simple to parallelize the loop (in the launch function in TetraMCCUDAKernel.hpp).

Edited Jun 13, 2019 by Tanner Young-Schultz