What are the best CUDA kernel launch parameters?
Repeat investigations for different kernel launch parameters. Also use profiling for more insight.
Consider the optional 'tpb' parameter for special kernels, like when using the parallel helper functions on small arrays. This could be useful for the 'launch_cuda_kernel_x_sum[s]' functions where the return_array is a small array with the reduction values and needs to get reset to 0 forch each seperate reduction.