Skip to content

Fix incorrect NEON native fp16 multiplication.

TensorFlow's tensor contractions currently fail on ARM hardware with native fp16 support due to a bug in the specialized kernel implementation.

All the NEON GEBP specializations are a bit hacky, in that they replace the RHS packet with a single scalar, then use special instructions to perform a Packet += Packet * Scalar. In the case of native __fp16, where we have a Packet8h, this broke an assumption that we can split the left-hand packet into groups of 4 elements, then multiply by a RHS loaded via ploadquad. The hack works for floats, since the packet size is 4, so ploadquad fills the packet with a single value, which we can mimic using multiplication by a single scalar. However, the assumption breaks down when the packet size is 8.

Put in a fallback in the general GEBP kernel to avoid ploadquad when not feasible, and added an assertion to the NEON __fp16 specialization.

Merge request reports

Loading