Products where RHS is narrow perform better with non-default blocking sizes

Submitted by Benoit Jacob

Assigned to Nobody

Link to original bugzilla bug (#938)

Description

For example, MatrixXf products of size (256 x 256) times (256 x 16), so the RHS is narrow.

The attachment in bug #937 comment 3 shows that on a Nexus 4, the default blocking parameter kc=256 gives only 3.8 GFlop/s, while the lower value kc=128 gives 9.2 GFlops/s, more than 2x faster!

On a Core i7, the attachment in bug #937 comment 1 shows that the default blocking parameter kc=256 gives 61 GFlop/s, while the lower value kc=128 gives 68.5 GFlop/s for sufficiently small mc, and kc=64 even gives 69.5 GFlop/s for all values of mc!

The Core i7 results also have L1 cache miss counts, which show that the performance degradation comes with an increase of the L1 cache read miss count.

size (256,256,16), block (64,32,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (64,64,16), l1_misses (r: 7.5e+03, w: 3.03e+03), time=3.1e-05s, achieved 67.7 GFlop/s
size (256,256,16), block (64,128,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (64,256,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (128,16,16), l1_misses (r: 7.59e+03, w: 2.88e+03), time=3.06e-05s, achieved 68.5 GFlop/s
size (256,256,16), block (128,32,16), l1_misses (r: 7.59e+03, w: 2.88e+03), time=3.06e-05s, achieved 68.5 GFlop/s
size (256,256,16), block (128,64,16), l1_misses (r: 9.61e+03, w: 2.88e+03), time=3.25e-05s, achieved 64.6 GFlop/s
size (256,256,16), block (128,128,16), l1_misses (r: 9.61e+03, w: 2.88e+03), time=3.24e-05s, achieved 64.7 GFlop/s
size (256,256,16), block (128,256,16), l1_misses (r: 9.6e+03, w: 2.88e+03), time=3.24e-05s, achieved 64.6 GFlop/s
size (256,256,16), block (256,16,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.3 GFlop/s
size (256,256,16), block (256,32,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.43e-05s, achieved 61.1 GFlop/s
size (256,256,16), block (256,64,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.2 GFlop/s
size (256,256,16), block (256,128,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.4 GFlop/s
size (256,256,16), block (256,256,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.43e-05s, achieved 61.2 GFlop/s

Thus we see two classes of cases:

kc == 64 or (kc == 128 and mc <= 32) -> 7590 L1 cache read misses, 69.5 GFlop/s
kc == 128 and mc >= 64 -> 9610 L1 cache read misses, 68.5 GFlop/s
kc == 256 -> 10900 cache read misses, 61 GFlops/s

Given that this CPU has a L1 data cache of 32k, can you make sense of these results?

Blocking

#937

Edited Dec 05, 2019 by Eigen Bugzilla