Fix the bug using neon instruction fmla for data type half
Reference issue
What does this implement/fix?
The register at operand 3 of fmla for data type half must be v0~v15, inline assembly can't be used here to advoid the bug that vfmaq_lane_f16 is implemented through a costly dup in gcc compiler. However, when gcc compiler is enable, using the intrinsics will lead to performance degradation, so I make a restriction here.
Additional information
This bug is not triggered by coincidence when EIGEN_NEON_GEBP_NR=8. If EIGEN_NEON_GEBP_NR is set to 4, gcc compiler will report the following error

Edited  by Lianhuang Li