Use 3px8/2px8/1px8/1x8 gebp_kernel on arm64-neon
Reference issue
#2518 (closed)What does this implement/fix?
I found that the gebp_kernel used by eigen neon is 3px4/2px4/1px4 by default. This is reasonable on x86(avx/fma) and arm32. However, arm64 neon has 32 registers, and a larger size gebp_kernel can be used to get better data reuse to improve performance.
Therefore, I implement gebp_kernel 3px8/2px8/1px8 on eigen (3px8 24 registers for acc, 3 for lhs, 1 for rhs).
Additional information
benchmark
clang
dgemm
sgemm
hgemm
gcc
dgemm
sgemm
hgemm
platform : Ampere® Altra
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
Stepping: r3p1
CPU max MHz: 3000.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
L1d: 8 MiB (128 instances)
L1i: 8 MiB (128 instances)
L2: 128 MiB (128 instances)











