Skip to content

Add dot() operation.

Ryan Curtin requested to merge rcurtin/bandicoot-code:dot into unstable

This adds a reasonable but not great strategy for computing the dot() product, based on the existing two-stage accu(). It first performs a chunked dot product, then accumulates (with one thread) the results of that chunking. Really we should eventually implement a generic, efficient reduce and use that instead of the dot_twostage (and accu_twostage) kernels, but this works for now and should have reasonable performance.

This fixes #5 (closed).

Merge request reports