Improve float matrix multiplication performance on ARM NEON (take 2)

Submitted by Benoit Jacob

Assigned to Nobody

This is a follow-up from bug #1624 (closed).

*** Summary so far: ***

We know what a fast float GEMM kernel looks like on ARM NEON: it should take advantage of multiply-accumulate-against-single-element instructions, like this:
https://github.com/google/gemmlowp/blob/3fb5c176c17c765a3492cd2f0321b0dab712f350/standalone/neon-gemm-kernel-benchmark.cc#L4670-L4716
The patches in bug #1624 (closed) implemented that basic idea of taking advantage of multiply-by-element instructions. However, they didn't take advantage of the ability to multiply by an arbitrary element, they only used multiplication by the 0-th element in a vector, with repeated loading of new data into that 0-th element. Finally, that was submitted as a4760548.
The patches in bug #1624 (closed) also had a crash issue, loading 8 bytes when only 4 may be present.
The crash issue was fixed in e01823ce7f6e, however that turned out to regress performance. Looking at disassembly, the reason why this regresses performance is that at least Clang compiles vmlaq_n_f32 as a ld1r instruction (load scalar and duplicate onto all lanes) followed by an ordinary multiply-add instruction, not taking advantage of multiply-by-element; and it seems that ld1r does not dual-issue so well with multiply-add.

*** New patch ***

This new patch:

Implements exactly the original idea of fast kernels on NEON, with a 128-bit load loading 4 RHS float values, each of them used in-place by a multiply-add-by-element instruction.
Offers higher performance overall than even the fast code from above step 2 (a4760548).
Does not read data out of bounds, unlike the fast code from above step 2 (a4760548).

Edited Dec 05, 2019 by Eigen Bugzilla