Consolidate float and double implementations of patan().
The new implementations use the same range reduction to [-1:1] and only use separate rational approximations for x in [-1:1].
Results differ less than 3 ULPs from std::atan.
This gives a speedup for some combinations of type and ISA.
| ISA | Type | Speedup |
|---|---|---|
| SSE 4.2 | float | 30% |
| AVX2+FMA | float | 25% |
| AVX512 | float | 0 |
| SSE 4.2 | double | 3.5% |
| AVX2+FMA | double | 20% |
| AVX512 | double | -3% |
Full benchmark results are here.
Edited by Rasmus Munk Larsen