Avoid integer overflow in EigenMetaKernel indexing (v2)
This is a re-submission of !681 (merged), which was reverted due to build issues on Windows.
This version has two changes compared to the previous version:
- It doesn't use inline PTX, so there shouldn't be any build issues on Windows.
- It only uses saturated addition in each loop iteration when overflow is possible (i.e., when the size is within total_threads of the max representable index). When overflow is not possible, regular addition is used.
Summary of changes:
- The current implementation computes
size + total_threads, which can overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to the maximum representable value. - The num_blocks calculation can also overflow due to the implementation of divup().
- This patch prevents these overflows and allows the kernel to work correctly for the full representable range of tensor sizes.
- Also adds relevant tests.
cc @nluehr