Optimizations are: 1. Reduce code size (-58 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Reduce rodata size ([-128, -188] bytes).
Result is roughly a 14% speedup:
Function, New Time, Old Time, New / Old
_ZGVeN16v_atanhf, 11.95, 13.879, 0.861