x86: Optimize svml_s_atanhf16_core_avx512.S
Optimizations are:
1. Reduce code size (-58 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Reduce rodata size ([-128, -188] bytes).
Result is roughly a 14% speedup:
Function, New Time, Old Time, New / Old
_ZGVeN16v_atanhf, 11.95, 13.879, 0.861