Improve float matrix multiplication performance on ARM NEON (take 2)
## Submitted by Benoit Jacob Assigned to **Nobody** **[Link to original bugzilla bug (#1633)](https://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633)** **Platform**: ARM - NEON ## Description This is a follow-up from bug #1624. *** Summary so far: *** 1. We know what a fast float GEMM kernel looks like on ARM NEON: it should take advantage of multiply-accumulate-against-single-element instructions, like this: https://github.com/google/gemmlowp/blob/3fb5c176c17c765a3492cd2f0321b0dab712f350/standalone/neon-gemm-kernel-benchmark.cc#L4670-L4716 2. The patches in bug #1624 implemented that basic idea of taking advantage of multiply-by-element instructions. However, they didn't take advantage of the ability to multiply by an *arbitrary* element, they only used multiplication by the 0-th element in a vector, with repeated loading of new data into that 0-th element. Finally, that was submitted as a4760548793811ee1accf8de05ff791a43d54be5. 3. The patches in bug #1624 also had a crash issue, loading 8 bytes when only 4 may be present. 4. The crash issue was fixed in e01823ce7f6e, however that turned out to regress performance. Looking at disassembly, the reason why this regresses performance is that at least Clang compiles vmlaq_n_f32 as a ld1r instruction (load scalar and duplicate onto all lanes) followed by an ordinary multiply-add instruction, not taking advantage of multiply-by-element; and it seems that ld1r does not dual-issue so well with multiply-add. *** New patch *** This new patch: - Implements exactly the original idea of fast kernels on NEON, with a 128-bit load loading 4 RHS float values, each of them used in-place by a multiply-add-by-element instruction. - Offers higher performance overall than even the fast code from above step 2 (a4760548793811ee1accf8de05ff791a43d54be5). - Does not read data out of bounds, unlike the fast code from above step 2 (a4760548793811ee1accf8de05ff791a43d54be5). ### Blocking #1642
issue