Fix incorrect NEON native fp16 multiplication.
TensorFlow's tensor contractions currently fail on ARM hardware with native fp16 support due to a bug in the specialized kernel implementation.
All the NEON GEBP specializations are a bit hacky, in that they replace the
RHS packet with a single scalar, then use special instructions to
perform a Packet += Packet * Scalar. In the case of native __fp16,
where we have a Packet8h, this broke an assumption that we can split
the left-hand packet into groups of 4 elements, then multiply by a RHS loaded
via ploadquad. The hack works for floats, since the packet size is
4, so ploadquad fills the packet with a single value, which we can
mimic using multiplication by a single scalar. However, the assumption breaks
down when the packet size is 8.
Put in a fallback in the general GEBP kernel to avoid ploadquad when
not feasible, and added an assertion to the NEON __fp16
specialization.