Fix CUDA `max_3` test and some subtle bugs
The max_3
test in tests/max.cpp
was known to be failing for the CUDA backend. Today I took a look into it, and uncovered some other little issues along the way:
- Adapted
max()
andmin()
to usegeneric_reduce()
(just a cleanup). - Found that
generic_reduce()
for CUDA was only handling one element per thread! Fixed. - Discovered some very subtle bugs in the max/min CUDA kernels that meant that the second element inspected by every thread would always be ignored.
- Fixing those CUDA kernels required bumping the patch version.
- Fixed a simple compilation bug for the expression
max(max(abs(X)))
. - Seems like the
accu
benchmark program got modified somewhere along the way, so I reverted it back to what it was.