NBNxM CPU kernel loop unroll issue

Some compilers, in particular the LLVM backend for ARM, do not unroll some short, fixed length loop in the innermost loop of the NBNxM CPU SIMD kernels. There loops were introduced in commit 0ad4a2ed in 2022, so the performance of release-2023 and release-2024 is affected by this.

This is part of #4752, where a lot of performance results are discussed.