Add dot() operation.
This adds a reasonable but not great strategy for computing the dot()
product, based on the existing two-stage accu()
. It first performs a chunked dot product, then accumulates (with one thread) the results of that chunking. Really we should eventually implement a generic, efficient reduce and use that instead of the dot_twostage
(and accu_twostage
) kernels, but this works for now and should have reasonable performance.
This fixes #5 (closed).