Gemv microoptimization
Reference issue
What does this implement/fix?
Explicitly defining the loop bounds for the unrolled stages that increment by PacketSize fixes aggressive loop optimization compiler warnings. I learned this trick to minimize the overhead of rounding down the nearest power of two. Dividing and multiplying by a compile-time power of two entails a left and right shift. This can be further optimized to a single bitwise and.
Normally this optimization is automatically applied by the compiler -- if the type is an unsigned integer. Index is a signed integer, so the compiler plays it safe. Our indices are always non-negative, so we can skip this check.
https://godbolt.org/z/a6drKb6W8
I wanted to address this fix before cherry picking it to 3.4.
Additional information
Edited by Charles Schlosser