Add a yield instruction in the two spinloops of the threaded matmul implementation.
Adding a std::this_thread::yield()
hint in the two spinloops in the threaded matmul implementation yields a fairly significant savings in number of instructions issued. It does not have a noticeably effect on wall time for the operation. However, this will save power and possibly yield the CPU to do useful work on other tasks.
Benchmark code:
template<typename T>
void BM_MatMulScaling(benchmark::State& state) {
using Matrix =
Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
int num_threads = state.range(0);
constexpr int n = 4096;
Matrix a(n, n), b(n, n), c(n, n);
Eigen::ThreadPool thread_pool(num_threads);
Eigen::setGemmThreadPool(&thread_pool);
a.setRandom();
b.setRandom();
c.setZero();
for (auto s : state) {
c.noalias() += a * b;
}
}
BENCHMARK(BM_MatMulScaling<float>)->Arg(1)->Arg(2)->Arg(4)->Arg(6)->Arg(8)->Arg(12)->Arg(16)->Arg(32)->Arg(36)->Arg(72);
Measurements taken on on Intel(R) Xeon(R) Gold 6154 (Skylake-X), compiled with clang (roughly at HEAD) using -march=haswell
.
name old cpu/op new cpu/op delta
BM_MatMulScaling<float>/1 1.67s ± 2% 1.69s ± 1% ~ (p=0.151 n=5+5)
BM_MatMulScaling<float>/2 1.70s ± 7% 1.70s ± 2% ~ (p=0.548 n=5+5)
BM_MatMulScaling<float>/4 1.73s ± 4% 1.73s ± 2% ~ (p=0.548 n=5+5)
BM_MatMulScaling<float>/6 1.81s ±11% 1.75s ± 2% ~ (p=0.310 n=5+5)
BM_MatMulScaling<float>/8 1.76s ± 3% 1.75s ± 4% ~ (p=0.548 n=5+5)
BM_MatMulScaling<float>/12 1.83s ± 4% 1.82s ± 1% ~ (p=0.905 n=5+4)
BM_MatMulScaling<float>/16 2.08s ±42% 2.00s ±32% ~ (p=1.000 n=5+5)
BM_MatMulScaling<float>/32 3.23s ± 5% 3.31s ± 5% ~ (p=0.310 n=5+5)
BM_MatMulScaling<float>/36 3.69s ±32% 3.74s ±35% ~ (p=0.841 n=5+5)
BM_MatMulScaling<float>/72 9.40s ±60% 7.95s ±14% ~ (p=0.690 n=5+5)
name old time/op new time/op delta
BM_MatMulScaling<float>/1 1.67s ± 2% 1.69s ± 1% ~ (p=0.222 n=5+5)
BM_MatMulScaling<float>/2 850ms ± 7% 852ms ± 2% ~ (p=0.548 n=5+5)
BM_MatMulScaling<float>/4 433ms ± 4% 434ms ± 2% ~ (p=0.548 n=5+5)
BM_MatMulScaling<float>/6 302ms ±11% 293ms ± 2% ~ (p=0.310 n=5+5)
BM_MatMulScaling<float>/8 221ms ± 3% 220ms ± 4% ~ (p=0.690 n=5+5)
BM_MatMulScaling<float>/12 153ms ± 4% 152ms ± 1% ~ (p=0.905 n=5+4)
BM_MatMulScaling<float>/16 130ms ±42% 126ms ±32% ~ (p=1.000 n=5+5)
BM_MatMulScaling<float>/32 102ms ± 5% 104ms ± 5% ~ (p=0.310 n=5+5)
BM_MatMulScaling<float>/36 104ms ±31% 105ms ±35% ~ (p=0.841 n=5+5)
BM_MatMulScaling<float>/72 136ms ±59% 123ms ±17% ~ (p=1.000 n=5+5)
name old INSTRUCTIONS/op new INSTRUCTIONS/op delta
BM_MatMulScaling<float>/1 15.1G ± 0% 15.1G ± 0% ~ (p=0.690 n=5+5)
BM_MatMulScaling<float>/2 15.5G ± 7% 15.1G ± 0% ~ (p=0.151 n=5+5)
BM_MatMulScaling<float>/4 15.4G ± 2% 15.1G ± 0% -1.56% (p=0.008 n=5+5)
BM_MatMulScaling<float>/6 16.1G ±13% 15.1G ± 0% -6.01% (p=0.008 n=5+5)
BM_MatMulScaling<float>/8 15.4G ± 2% 15.1G ± 0% ~ (p=0.056 n=5+5)
BM_MatMulScaling<float>/12 16.0G ± 3% 15.1G ± 0% -5.30% (p=0.008 n=5+5)
BM_MatMulScaling<float>/16 15.9G ± 4% 15.1G ± 0% -4.51% (p=0.029 n=4+4)
BM_MatMulScaling<float>/32 18.9G ±16% 15.1G ± 0% -20.06% (p=0.016 n=5+4)
BM_MatMulScaling<float>/36 21.9G ±34% 15.2G ± 0% -30.89% (p=0.008 n=5+5)
BM_MatMulScaling<float>/72 56.1G ±73% 15.2G ± 0% -72.93% (p=0.008 n=5+5)
name old CYCLES/op new CYCLES/op delta
BM_MatMulScaling<float>/1 5.70G ± 1% 5.70G ± 0% ~ (p=0.421 n=5+5)
BM_MatMulScaling<float>/2 5.83G ± 8% 5.69G ± 1% ~ (p=0.548 n=5+5)
BM_MatMulScaling<float>/4 5.72G ± 4% 5.66G ± 2% ~ (p=0.841 n=5+5)
BM_MatMulScaling<float>/6 5.97G ±12% 5.66G ± 1% -5.17% (p=0.008 n=5+5)
BM_MatMulScaling<float>/8 5.79G ± 3% 5.69G ± 1% ~ (p=0.056 n=5+5)
BM_MatMulScaling<float>/12 6.01G ± 5% 5.76G ± 1% -4.16% (p=0.008 n=5+5)
BM_MatMulScaling<float>/16 6.89G ±46% 5.87G ± 6% ~ (p=0.151 n=5+5)
BM_MatMulScaling<float>/32 10.1G ± 4% 9.3G ± 1% -8.11% (p=0.016 n=5+4)
BM_MatMulScaling<float>/36 10.4G ± 3% 9.1G ±11% -12.31% (p=0.016 n=4+5)
BM_MatMulScaling<float>/72 30.7G ±67% 11.0G ± 1% -64.07% (p=0.008 n=5+5)
Flame graph before (for 74 threads, so strongly oversubscribed):
Flame graph after (for 74 threads, so strongly oversubscribed):
Output of perf stat -d -d -d -v -- $command
with 72 threads. Notice that while elapsed time remains roughly the same (within measurement noise), cycles, instructions, dcache loads, and dTLB loads decrease dramatically by adding yield:
With yield:
Performance counter stats for './blaze-bin/experimental/users/rmlarsen/bench/matmul_bench --benchmark_filter=BM_MatMulScaling<float>/72':
7,298.94 msec task-clock:u # 35.380 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
1,768 page-faults:u # 242.227 /sec
8,977,569,476 cycles:u # 1.230 GHz (38.99%)
12,988,913,246 instructions:u # 1.45 insn per cycle (46.70%)
111,602,532 branches:u # 15.290 M/sec (44.59%)
2,242,969 branch-misses:u # 2.01% of all branches (42.43%)
5,812,400,949 L1-dcache-loads:u # 796.335 M/sec (40.29%)
1,294,947,000 L1-dcache-load-misses:u # 22.28% of all L1-dcache accesses (37.84%)
14,235,824 LLC-loads:u # 1.950 M/sec (28.99%)
7,036,400 LLC-load-misses:u # 49.43% of all LL-cache accesses (30.08%)
<not supported> L1-icache-loads:u
1,328,642 L1-icache-load-misses:u (31.87%)
5,486,017,833 dTLB-loads:u # 751.618 M/sec (33.94%)
17,543 dTLB-load-misses:u # 0.00% of all dTLB cache accesses (35.49%)
45,339 iTLB-loads:u # 6.212 K/sec (35.98%)
38,021 iTLB-load-misses:u # 83.86% of all iTLB cache accesses (38.04%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
0.206303517 seconds time elapsed
4.425312000 seconds user
2.870901000 seconds sys
Without yield:
Performance counter stats for './blaze-bin/experimental/users/rmlarsen/bench/matmul_bench --benchmark_filter=BM_MatMulScaling<float>/72':
6,576.06 msec task-clock:u # 34.423 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
1,721 page-faults:u # 261.707 /sec
20,070,865,986 cycles:u # 3.052 GHz (39.22%)
35,503,400,176 instructions:u # 1.77 insn per cycle (46.77%)
5,537,475,000 branches:u # 842.066 M/sec (44.33%)
1,041,366 branch-misses:u # 0.02% of all branches (41.10%)
16,859,609,447 L1-dcache-loads:u # 2.564 G/sec (37.15%)
1,230,169,930 L1-dcache-load-misses:u # 7.30% of all L1-dcache accesses (32.78%)
14,661,464 LLC-loads:u # 2.230 M/sec (22.13%)
5,860,608 LLC-load-misses:u # 39.97% of all LL-cache accesses (27.21%)
<not supported> L1-icache-loads:u
611,297 L1-icache-load-misses:u (35.65%)
11,596,468,243 dTLB-loads:u # 1.763 G/sec (43.08%)
6,635 dTLB-load-misses:u # 0.00% of all dTLB cache accesses (44.94%)
2,030 iTLB-loads:u # 308.696 /sec (43.34%)
709 iTLB-load-misses:u # 34.93% of all iTLB cache accesses (41.15%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
0.191036989 seconds time elapsed
6.476650000 seconds user
0.095596000 seconds sys