Consolidate float and double implementations of patan().
The new implementations use the same range reduction to [-1:1] and only use separate rational approximations for x in [-1:1].
Results differ less than 3 ULPs from std::atan.
This gives a speedup for some combinations of type and ISA.
ISA | Type | Speedup |
---|---|---|
SSE 4.2 | float | 30% |
AVX2+FMA | float | 25% |
AVX512 | float | 0 |
SSE 4.2 | double | 3.5% |
AVX2+FMA | double | 20% |
AVX512 | double | -3% |
Full benchmark results are here.
Edited by Rasmus Munk Larsen