Improve float matrix multiplication performance on ARM NEON (take 2)
Submitted by Benoit Jacob
Assigned to Nobody
Link to original bugzilla bug (#1633)
Platform: ARM  NEON
Description
This is a followup from bug #1624 (closed).
*** Summary so far: ***

We know what a fast float GEMM kernel looks like on ARM NEON: it should take advantage of multiplyaccumulateagainstsingleelement instructions, like this:
https://github.com/google/gemmlowp/blob/3fb5c176c17c765a3492cd2f0321b0dab712f350/standalone/neongemmkernelbenchmark.cc#L4670L4716 
The patches in bug #1624 (closed) implemented that basic idea of taking advantage of multiplybyelement instructions. However, they didn't take advantage of the ability to multiply by an arbitrary element, they only used multiplication by the 0th element in a vector, with repeated loading of new data into that 0th element. Finally, that was submitted as a4760548.

The patches in bug #1624 (closed) also had a crash issue, loading 8 bytes when only 4 may be present.

The crash issue was fixed in e01823ce7f6e, however that turned out to regress performance. Looking at disassembly, the reason why this regresses performance is that at least Clang compiles vmlaq_n_f32 as a ld1r instruction (load scalar and duplicate onto all lanes) followed by an ordinary multiplyadd instruction, not taking advantage of multiplybyelement; and it seems that ld1r does not dualissue so well with multiplyadd.
*** New patch ***
This new patch:
 Implements exactly the original idea of fast kernels on NEON, with a 128bit load loading 4 RHS float values, each of them used inplace by a multiplyaddbyelement instruction.
 Offers higher performance overall than even the fast code from above step 2 (a4760548).
 Does not read data out of bounds, unlike the fast code from above step 2 (a4760548).