Fix arm32 float division and related bugs
Reference issue
What does this implement/fix?
ARM32 NEON intrinsics flush to zero. This is problematic for denormal input, and also for some very large input whose reciprocal is denormal (among other issues). This patch fixes the following:
- ARM32 has no vectorized float32 division, but it has reciprocal intrinsics. Currently, division is computing as
a / b = a * recip(b). Ifbis very large, thenprecip(b)is denormal, and is flushed to zero. This patch uses the following procedure:a / b = f * a * reciprocal(f * b)wheref = 0.25.fis only used whenbis very large, thus maintaining support for very small (normal) values ofb. - Increase reciprocal refinement iterations to 2. Currently, there is only 1 refinement step, which is insufficient for many applications (a particularly egregious example is
1.0 / 1.0 != 1.0f). This fixes several floating point functions that rely on reasonably accuratepdiv. - ARM32 has no vectorized sqrt, but has reciprocal sqrt intrinsics. Use these intrinsics instead of the generic implementation. Use two refinement steps. Minimize needless error handling while still handling edge cases correctly.
- Change the tests so that ARM32 doesn't attempt computations on denormal numbers (these will always fail), and don't check for correct results if the reference solution is denormal.
Fixes the following tests in cross ci testing:
- 35 - packetmath_1 (Child aborted)
- 49 - packetmath_15 (Child aborted)
- 247 - array_cwise_11 (Child aborted)
- 249 - array_cwise_12 (Child aborted)
- 251 - array_cwise_14 (Child aborted)
- 253 - array_cwise_16 (Child aborted)
- 258 - array_cwise_21 (Child aborted)
- 449 - qr_colpivoting_1 (Child aborted)
- 493 - eigensolver_selfadjoint_3 (Child aborted)
- 550 - jacobisvd_26 (Child aborted)
- 551 - jacobisvd_27 (Child aborted)
- 606 - bdcsvd_27 (Child aborted)
- 607 - bdcsvd_28 (Child aborted)
- 643 - geo_quaternion_1 (Child aborted)
https://gitlab.com/libeigen/eigen_ci_cross_testing/-/pipelines/944835667
Also, I got rid of the sparse permutation test that counted the number of allocations for P * alpha * M. This only fails on arm32. I figure the test is bad, but I really have no idea why.
Additional information
Edited by Charles Schlosser