Skip to content
GitLab
    • GitLab: the DevOps platform
    • Explore GitLab
    • Install GitLab
    • How GitLab compares
    • Get started
    • GitLab docs
    • GitLab Learn
  • Pricing
  • Talk to an expert
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    • Switch to GitLab Next
    Projects Groups Topics Snippets
  • Register
  • Sign in
  • eigen eigen
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
    • Locked files
  • Issues 694
    • Issues 694
    • List
    • Boards
    • Service Desk
    • Milestones
    • Requirements
  • Custom issue tracker
    • Custom issue tracker
  • Merge requests 25
    • Merge requests 25
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Artifacts
    • Schedules
    • Test cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
    • Model experiments
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • libeigenlibeigen
  • eigeneigen
  • Issues
  • #938
Closed
Open
Issue created Dec 04, 2019 by Eigen Bugzilla@eigenbzReporter

Products where RHS is narrow perform better with non-default blocking sizes

Submitted by Benoit Jacob

Assigned to Nobody

Link to original bugzilla bug (#938)

Description

For example, MatrixXf products of size (256 x 256) times (256 x 16), so the RHS is narrow.

The attachment in bug #937 comment 3 shows that on a Nexus 4, the default blocking parameter kc=256 gives only 3.8 GFlop/s, while the lower value kc=128 gives 9.2 GFlops/s, more than 2x faster!

On a Core i7, the attachment in bug #937 comment 1 shows that the default blocking parameter kc=256 gives 61 GFlop/s, while the lower value kc=128 gives 68.5 GFlop/s for sufficiently small mc, and kc=64 even gives 69.5 GFlop/s for all values of mc!

The Core i7 results also have L1 cache miss counts, which show that the performance degradation comes with an increase of the L1 cache read miss count.

size (256,256,16), block (64,32,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (64,64,16), l1_misses (r: 7.5e+03, w: 3.03e+03), time=3.1e-05s, achieved 67.7 GFlop/s
size (256,256,16), block (64,128,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (64,256,16), l1_misses (r: 7.47e+03, w: 3.04e+03), time=3.02e-05s, achieved 69.5 GFlop/s
size (256,256,16), block (128,16,16), l1_misses (r: 7.59e+03, w: 2.88e+03), time=3.06e-05s, achieved 68.5 GFlop/s
size (256,256,16), block (128,32,16), l1_misses (r: 7.59e+03, w: 2.88e+03), time=3.06e-05s, achieved 68.5 GFlop/s
size (256,256,16), block (128,64,16), l1_misses (r: 9.61e+03, w: 2.88e+03), time=3.25e-05s, achieved 64.6 GFlop/s
size (256,256,16), block (128,128,16), l1_misses (r: 9.61e+03, w: 2.88e+03), time=3.24e-05s, achieved 64.7 GFlop/s
size (256,256,16), block (128,256,16), l1_misses (r: 9.6e+03, w: 2.88e+03), time=3.24e-05s, achieved 64.6 GFlop/s
size (256,256,16), block (256,16,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.3 GFlop/s
size (256,256,16), block (256,32,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.43e-05s, achieved 61.1 GFlop/s
size (256,256,16), block (256,64,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.2 GFlop/s
size (256,256,16), block (256,128,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.42e-05s, achieved 61.4 GFlop/s
size (256,256,16), block (256,256,16), l1_misses (r: 1.09e+04, w: 2.7e+03), time=3.43e-05s, achieved 61.2 GFlop/s

Thus we see two classes of cases:

kc == 64 or (kc == 128 and mc <= 32) -> 7590 L1 cache read misses, 69.5 GFlop/s
kc == 128 and mc >= 64 -> 9610 L1 cache read misses, 68.5 GFlop/s
kc == 256 -> 10900 cache read misses, 61 GFlops/s

Given that this CPU has a L1 data cache of 32k, can you make sense of these results?

Blocking

#937

Edited Dec 05, 2019 by Eigen Bugzilla
Assignee
Assign to
Time tracking