Improve accuracy of full tensor reduction for half and bfloat16
We use a tree summation algorithm for full tensor reduction. The relative error in summing n (positive) elements this way is bounded by ~2*eps*(log(n/B) + B)
, where B
is the size of the leaves in the tree, where we sum sequentially in the interest of speed. For less accurate types (i.e. types with larger eps), we reduce B to keep the relative error significantly below 1.
Edited by Rasmus Munk Larsen