Simpler range reduction strategy for atan<float>().
Reference issue
What does this implement/fix?
Additional information
This change saves a division and some pselect logic, in exchange for a couple of extra FMAs. The relative error is still <= 2 ulps, while speedup is 20-40% on x86. $2421160
Unfortunately, the same change is not viable for double without going to a very high polynomial degree, negating the benefit.
Also, this change refactors the inner polynomial approximations for atan<float>() and atan<double>() to separate functions for future use in a more efficient implementation of atan2().
Edited by Rasmus Munk Larsen