Improve float matrix multiplication performance on ARM NEON (take 2)
Submitted by Benoit Jacob
Assigned to Nobody
Link to original bugzilla bug (#1633)
Platform: ARM - NEON
Description
This is a follow-up from bug #1624 (closed).
*** Summary so far: ***
-
We know what a fast float GEMM kernel looks like on ARM NEON: it should take advantage of multiply-accumulate-against-single-element instructions, like this:
https://github.com/google/gemmlowp/blob/3fb5c176c17c765a3492cd2f0321b0dab712f350/standalone/neon-gemm-kernel-benchmark.cc#L4670-L4716 -
The patches in bug #1624 (closed) implemented that basic idea of taking advantage of multiply-by-element instructions. However, they didn't take advantage of the ability to multiply by an arbitrary element, they only used multiplication by the 0-th element in a vector, with repeated loading of new data into that 0-th element. Finally, that was submitted as a4760548.
-
The patches in bug #1624 (closed) also had a crash issue, loading 8 bytes when only 4 may be present.
-
The crash issue was fixed in e01823ce7f6e, however that turned out to regress performance. Looking at disassembly, the reason why this regresses performance is that at least Clang compiles vmlaq_n_f32 as a ld1r instruction (load scalar and duplicate onto all lanes) followed by an ordinary multiply-add instruction, not taking advantage of multiply-by-element; and it seems that ld1r does not dual-issue so well with multiply-add.
*** New patch ***
This new patch:
- Implements exactly the original idea of fast kernels on NEON, with a 128-bit load loading 4 RHS float values, each of them used in-place by a multiply-add-by-element instruction.
- Offers higher performance overall than even the fast code from above step 2 (a4760548).
- Does not read data out of bounds, unlike the fast code from above step 2 (a4760548).