Add a yield instruction in the two spinloops of the threaded matmul implementation. (!1666) · Merge requests · libeigen / eigen

Adding a std::this_thread::yield() hint in the two spinloops in the threaded matmul implementation yields a fairly significant savings in number of instructions issued. It does not have a noticeably effect on wall time for the operation. However, this will save power and possibly yield the CPU to do useful work on other tasks.

Benchmark code:

template<typename T>
void BM_MatMulScaling(benchmark::State& state) {
  using Matrix =
      Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
  int num_threads = state.range(0);
  constexpr int n = 4096;
  Matrix a(n, n), b(n, n), c(n, n);
  Eigen::ThreadPool thread_pool(num_threads);
  Eigen::setGemmThreadPool(&thread_pool);
  a.setRandom();
  b.setRandom();
  c.setZero();
  for (auto s : state) {
    c.noalias() += a * b;
  }
}
BENCHMARK(BM_MatMulScaling<float>)->Arg(1)->Arg(2)->Arg(4)->Arg(6)->Arg(8)->Arg(12)->Arg(16)->Arg(32)->Arg(36)->Arg(72);

Measurements taken on on Intel(R) Xeon(R) Gold 6154 (Skylake-X), compiled with clang (roughly at HEAD) using -march=haswell.

name                         old cpu/op   new cpu/op   delta
BM_MatMulScaling<float>/1    1.67s ± 2%   1.69s ± 1%     ~     (p=0.151 n=5+5)
BM_MatMulScaling<float>/2    1.70s ± 7%   1.70s ± 2%     ~     (p=0.548 n=5+5)
BM_MatMulScaling<float>/4    1.73s ± 4%   1.73s ± 2%     ~     (p=0.548 n=5+5)
BM_MatMulScaling<float>/6    1.81s ±11%   1.75s ± 2%     ~     (p=0.310 n=5+5)
BM_MatMulScaling<float>/8    1.76s ± 3%   1.75s ± 4%     ~     (p=0.548 n=5+5)
BM_MatMulScaling<float>/12   1.83s ± 4%   1.82s ± 1%     ~     (p=0.905 n=5+4)
BM_MatMulScaling<float>/16   2.08s ±42%   2.00s ±32%     ~     (p=1.000 n=5+5)
BM_MatMulScaling<float>/32   3.23s ± 5%   3.31s ± 5%     ~     (p=0.310 n=5+5)
BM_MatMulScaling<float>/36   3.69s ±32%   3.74s ±35%     ~     (p=0.841 n=5+5)
BM_MatMulScaling<float>/72   9.40s ±60%   7.95s ±14%     ~     (p=0.690 n=5+5)

name                         old time/op             new time/op             delta
BM_MatMulScaling<float>/1    1.67s ± 2%              1.69s ± 1%     ~             (p=0.222 n=5+5)
BM_MatMulScaling<float>/2    850ms ± 7%              852ms ± 2%     ~             (p=0.548 n=5+5)
BM_MatMulScaling<float>/4    433ms ± 4%              434ms ± 2%     ~             (p=0.548 n=5+5)
BM_MatMulScaling<float>/6    302ms ±11%              293ms ± 2%     ~             (p=0.310 n=5+5)
BM_MatMulScaling<float>/8    221ms ± 3%              220ms ± 4%     ~             (p=0.690 n=5+5)
BM_MatMulScaling<float>/12   153ms ± 4%              152ms ± 1%     ~             (p=0.905 n=5+4)
BM_MatMulScaling<float>/16   130ms ±42%              126ms ±32%     ~             (p=1.000 n=5+5)
BM_MatMulScaling<float>/32   102ms ± 5%              104ms ± 5%     ~             (p=0.310 n=5+5)
BM_MatMulScaling<float>/36   104ms ±31%              105ms ±35%     ~             (p=0.841 n=5+5)
BM_MatMulScaling<float>/72   136ms ±59%              123ms ±17%     ~             (p=1.000 n=5+5)

name                         old INSTRUCTIONS/op     new INSTRUCTIONS/op     delta
BM_MatMulScaling<float>/1    15.1G ± 0%              15.1G ± 0%     ~             (p=0.690 n=5+5)
BM_MatMulScaling<float>/2    15.5G ± 7%              15.1G ± 0%     ~             (p=0.151 n=5+5)
BM_MatMulScaling<float>/4    15.4G ± 2%              15.1G ± 0%   -1.56%          (p=0.008 n=5+5)
BM_MatMulScaling<float>/6    16.1G ±13%              15.1G ± 0%   -6.01%          (p=0.008 n=5+5)
BM_MatMulScaling<float>/8    15.4G ± 2%              15.1G ± 0%     ~             (p=0.056 n=5+5)
BM_MatMulScaling<float>/12   16.0G ± 3%              15.1G ± 0%   -5.30%          (p=0.008 n=5+5)
BM_MatMulScaling<float>/16   15.9G ± 4%              15.1G ± 0%   -4.51%          (p=0.029 n=4+4)
BM_MatMulScaling<float>/32   18.9G ±16%              15.1G ± 0%  -20.06%          (p=0.016 n=5+4)
BM_MatMulScaling<float>/36   21.9G ±34%              15.2G ± 0%  -30.89%          (p=0.008 n=5+5)
BM_MatMulScaling<float>/72   56.1G ±73%              15.2G ± 0%  -72.93%          (p=0.008 n=5+5)

name                         old CYCLES/op           new CYCLES/op           delta
BM_MatMulScaling<float>/1    5.70G ± 1%              5.70G ± 0%     ~             (p=0.421 n=5+5)
BM_MatMulScaling<float>/2    5.83G ± 8%              5.69G ± 1%     ~             (p=0.548 n=5+5)
BM_MatMulScaling<float>/4    5.72G ± 4%              5.66G ± 2%     ~             (p=0.841 n=5+5)
BM_MatMulScaling<float>/6    5.97G ±12%              5.66G ± 1%   -5.17%          (p=0.008 n=5+5)
BM_MatMulScaling<float>/8    5.79G ± 3%              5.69G ± 1%     ~             (p=0.056 n=5+5)
BM_MatMulScaling<float>/12   6.01G ± 5%              5.76G ± 1%   -4.16%          (p=0.008 n=5+5)
BM_MatMulScaling<float>/16   6.89G ±46%              5.87G ± 6%     ~             (p=0.151 n=5+5)
BM_MatMulScaling<float>/32   10.1G ± 4%               9.3G ± 1%   -8.11%          (p=0.016 n=5+4)
BM_MatMulScaling<float>/36   10.4G ± 3%               9.1G ±11%  -12.31%          (p=0.016 n=4+5)
BM_MatMulScaling<float>/72   30.7G ±67%              11.0G ± 1%  -64.07%          (p=0.008 n=5+5)

Flame graph before (for 74 threads, so strongly oversubscribed):

Flame graph after (for 74 threads, so strongly oversubscribed):

Output of perf stat -d -d -d -v -- $command with 72 threads. Notice that while elapsed time remains roughly the same (within measurement noise), cycles, instructions, dcache loads, and dTLB loads decrease dramatically by adding yield:

With yield:

Performance counter stats for './blaze-bin/experimental/users/rmlarsen/bench/matmul_bench --benchmark_filter=BM_MatMulScaling<float>/72':

          7,298.94 msec task-clock:u                     #   35.380 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             1,768      page-faults:u                    #  242.227 /sec                      
     8,977,569,476      cycles:u                         #    1.230 GHz                         (38.99%)
    12,988,913,246      instructions:u                   #    1.45  insn per cycle              (46.70%)
       111,602,532      branches:u                       #   15.290 M/sec                       (44.59%)
         2,242,969      branch-misses:u                  #    2.01% of all branches             (42.43%)
     5,812,400,949      L1-dcache-loads:u                #  796.335 M/sec                       (40.29%)
     1,294,947,000      L1-dcache-load-misses:u          #   22.28% of all L1-dcache accesses   (37.84%)
        14,235,824      LLC-loads:u                      #    1.950 M/sec                       (28.99%)
         7,036,400      LLC-load-misses:u                #   49.43% of all LL-cache accesses    (30.08%)
   <not supported>      L1-icache-loads:u                                                     
         1,328,642      L1-icache-load-misses:u                                                 (31.87%)
     5,486,017,833      dTLB-loads:u                     #  751.618 M/sec                       (33.94%)
            17,543      dTLB-load-misses:u               #    0.00% of all dTLB cache accesses  (35.49%)
            45,339      iTLB-loads:u                     #    6.212 K/sec                       (35.98%)
            38,021      iTLB-load-misses:u               #   83.86% of all iTLB cache accesses  (38.04%)
   <not supported>      L1-dcache-prefetches:u                                                
   <not supported>      L1-dcache-prefetch-misses:u                                           

       0.206303517 seconds time elapsed

       4.425312000 seconds user
       2.870901000 seconds sys


Without yield:

Performance counter stats for './blaze-bin/experimental/users/rmlarsen/bench/matmul_bench --benchmark_filter=BM_MatMulScaling<float>/72':

          6,576.06 msec task-clock:u                     #   34.423 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             1,721      page-faults:u                    #  261.707 /sec                      
    20,070,865,986      cycles:u                         #    3.052 GHz                         (39.22%)
    35,503,400,176      instructions:u                   #    1.77  insn per cycle              (46.77%)
     5,537,475,000      branches:u                       #  842.066 M/sec                       (44.33%)
         1,041,366      branch-misses:u                  #    0.02% of all branches             (41.10%)
    16,859,609,447      L1-dcache-loads:u                #    2.564 G/sec                       (37.15%)
     1,230,169,930      L1-dcache-load-misses:u          #    7.30% of all L1-dcache accesses   (32.78%)
        14,661,464      LLC-loads:u                      #    2.230 M/sec                       (22.13%)
         5,860,608      LLC-load-misses:u                #   39.97% of all LL-cache accesses    (27.21%)
   <not supported>      L1-icache-loads:u                                                     
           611,297      L1-icache-load-misses:u                                                 (35.65%)
    11,596,468,243      dTLB-loads:u                     #    1.763 G/sec                       (43.08%)
             6,635      dTLB-load-misses:u               #    0.00% of all dTLB cache accesses  (44.94%)
             2,030      iTLB-loads:u                     #  308.696 /sec                        (43.34%)
               709      iTLB-load-misses:u               #   34.93% of all iTLB cache accesses  (41.15%)
   <not supported>      L1-dcache-prefetches:u                                                
   <not supported>      L1-dcache-prefetch-misses:u                                           

       0.191036989 seconds time elapsed

       6.476650000 seconds user
       0.095596000 seconds sys

Edited Aug 09, 2024 by Rasmus Munk Larsen

Add a yield instruction in the two spinloops of the threaded matmul implementation.

Merge request reports