Skip to content

Speed up tensor reduction

Speed up tensor reduction by strip mining & unrolling loops in InnerMostDimReducer and InnerMostDimPreserved. This change also cleans up a few redundant pieces of code, where deferring to an existing specialization was possible.

Below are measurements of full-, row-, and column- sum reductions of square 2D float tensors with sizes ranging from 3 x 3 to 10k x 10k. These were measured single-threaded on a Skylake core, and compiled with clang approximately at head.

AVX2:

name                                       old cpu/op  new cpu/op  delta
BM_fullReduction_1T/3   [using 1 threads]  13.4ns ± 8%  15.1ns ± 4%  +12.18%  (p=0.000 n=49+60)
BM_fullReduction_1T/4   [using 1 threads]  13.0ns ± 4%  15.4ns ±18%  +18.06%  (p=0.000 n=48+60)
BM_fullReduction_1T/7   [using 1 threads]  15.0ns ±13%  16.4ns ± 3%   +9.29%  (p=0.000 n=48+47)
BM_fullReduction_1T/8   [using 1 threads]  15.8ns ±19%  16.7ns ±13%   +5.84%  (p=0.000 n=60+60)
BM_fullReduction_1T/10  [using 1 threads]  18.7ns ±11%  18.5ns ± 9%     ~     (p=0.292 n=48+60)
BM_fullReduction_1T/15  [using 1 threads]  31.1ns ±12%  22.4ns ±17%  -27.80%  (p=0.000 n=52+57)
BM_fullReduction_1T/16  [using 1 threads]  34.3ns ±10%  23.0ns ±13%  -32.75%  (p=0.000 n=50+58)
BM_fullReduction_1T/31  [using 1 threads]   125ns ± 5%    48ns ± 9%  -61.81%  (p=0.000 n=60+60)
BM_fullReduction_1T/32  [using 1 threads]   134ns ± 6%    50ns ± 8%  -62.70%  (p=0.000 n=60+57)
BM_fullReduction_1T/64  [using 1 threads]   535ns ± 4%   160ns ± 5%  -70.06%  (p=0.000 n=60+60)
BM_fullReduction_1T/128 [using 1 threads]  2.15µs ± 4%  0.69µs ± 8%  -67.65%  (p=0.000 n=60+60)
BM_fullReduction_1T/256 [using 1 threads]  8.55µs ± 4%  2.77µs ± 5%  -67.65%  (p=0.000 n=60+55)
BM_fullReduction_1T/512 [using 1 threads]  34.5µs ± 3%  11.6µs ± 6%  -66.52%  (p=0.000 n=50+60)
BM_fullReduction_1T/1k  [using 1 threads]   155µs ± 4%   158µs ± 4%   +1.73%  (p=0.000 n=60+60)
BM_fullReduction_1T/2k  [using 1 threads]   682µs ±20%   684µs ±17%     ~     (p=0.475 n=40+45)
BM_fullReduction_1T/4k  [using 1 threads]  6.34ms ±12%  5.71ms ±11%   -9.98%  (p=0.000 n=39+35)
BM_fullReduction_1T/10k [using 1 threads]  37.4ms ± 7%  37.4ms ±32%     ~     (p=0.481 n=10+10)
 
 
name                                      old cpu/op  new cpu/op  delta
BM_rowReduction_1T/3   [using 1 threads]  29.0ns ± 7%  30.5ns ± 4%   +5.10%  (p=0.000 n=54+50)
BM_rowReduction_1T/4   [using 1 threads]  33.5ns ± 3%  38.5ns ± 4%  +15.07%  (p=0.000 n=50+50)
BM_rowReduction_1T/7   [using 1 threads]  54.6ns ± 4%  60.8ns ± 8%  +11.40%  (p=0.000 n=59+60)
BM_rowReduction_1T/8   [using 1 threads]  55.1ns ± 8%  52.1ns ± 9%   -5.40%  (p=0.000 n=60+60)
BM_rowReduction_1T/10  [using 1 threads]  75.8ns ± 7%  72.2ns ± 7%   -4.66%  (p=0.000 n=60+60)
BM_rowReduction_1T/15  [using 1 threads]   114ns ± 5%   123ns ± 6%   +7.98%  (p=0.000 n=60+60)
BM_rowReduction_1T/16  [using 1 threads]   102ns ± 5%    95ns ± 7%   -6.74%  (p=0.000 n=60+60)
BM_rowReduction_1T/31  [using 1 threads]   250ns ± 5%   264ns ± 4%   +5.56%  (p=0.000 n=55+55)
BM_rowReduction_1T/32  [using 1 threads]   232ns ± 4%   203ns ± 9%  -12.47%  (p=0.000 n=55+60)
BM_rowReduction_1T/64  [using 1 threads]   651ns ± 4%   482ns ± 6%  -25.95%  (p=0.000 n=60+60)
BM_rowReduction_1T/128 [using 1 threads]  1.90µs ± 3%  1.30µs ± 7%  -31.67%  (p=0.000 n=60+60)
BM_rowReduction_1T/256 [using 1 threads]  7.03µs ± 5%  3.69µs ± 5%  -47.44%  (p=0.000 n=60+49)
BM_rowReduction_1T/512 [using 1 threads]  28.6µs ± 4%  13.3µs ± 6%  -53.36%  (p=0.000 n=54+60)
BM_rowReduction_1T/1k  [using 1 threads]   158µs ± 9%   157µs ± 4%     ~     (p=0.948 n=60+60)
BM_rowReduction_1T/2k  [using 1 threads]   733µs ±37%   657µs ±13%  -10.36%  (p=0.000 n=45+40)
BM_rowReduction_1T/4k  [using 1 threads]  6.65ms ±11%  6.19ms ± 9%   -6.89%  (p=0.032 n=30+38)
BM_rowReduction_1T/10k [using 1 threads]  41.4ms ±11%  37.8ms ± 1%     ~     (p=0.080 n=12+10)
 
 
name                                      old cpu/op  new cpu/op  delta
BM_colReduction_1T/3   [using 1 threads]  21.8ns ± 5%  22.4ns ± 4%   +2.34%  (p=0.000 n=58+55)
BM_colReduction_1T/4   [using 1 threads]  20.8ns ± 6%  27.7ns ± 6%  +33.27%  (p=0.000 n=60+55)
BM_colReduction_1T/7   [using 1 threads]  32.0ns ± 4%  43.9ns ± 6%  +37.53%  (p=0.000 n=48+60)
BM_colReduction_1T/8   [using 1 threads]  28.7ns ±11%  24.8ns ± 3%  -13.81%  (p=0.000 n=53+55)
BM_colReduction_1T/10  [using 1 threads]  39.9ns ± 7%  37.8ns ± 4%   -5.12%  (p=0.000 n=53+50)
BM_colReduction_1T/15  [using 1 threads]  65.0ns ±10%  77.2ns ± 6%  +18.79%  (p=0.000 n=58+57)
BM_colReduction_1T/16  [using 1 threads]  56.5ns ± 7%  43.0ns ±21%  -23.92%  (p=0.000 n=48+60)
BM_colReduction_1T/31  [using 1 threads]   203ns ± 5%   210ns ± 6%   +3.46%  (p=0.000 n=60+59)
BM_colReduction_1T/32  [using 1 threads]   170ns ± 8%    95ns ± 7%  -44.18%  (p=0.000 n=60+60)
BM_colReduction_1T/64  [using 1 threads]   677ns ± 7%   261ns ± 4%  -61.43%  (p=0.000 n=60+55)
BM_colReduction_1T/128 [using 1 threads]  3.14µs ± 4%  1.40µs ± 5%  -55.45%  (p=0.000 n=50+60)
BM_colReduction_1T/256 [using 1 threads]  14.8µs ± 4%   5.4µs ± 6%  -63.24%  (p=0.000 n=60+60)
BM_colReduction_1T/512 [using 1 threads]  65.2µs ± 5%  25.2µs ± 5%  -61.31%  (p=0.000 n=60+55)
BM_colReduction_1T/1k  [using 1 threads]   754µs ± 6%   393µs ± 5%  -47.92%  (p=0.000 n=60+45)
BM_colReduction_1T/2k  [using 1 threads]  3.24ms ±18%  1.66ms ±17%  -48.61%  (p=0.000 n=35+42)
BM_colReduction_1T/4k  [using 1 threads]  70.3ms ± 3%  34.5ms ± 3%  -50.93%  (p=0.000 n=44+25)
BM_colReduction_1T/10k [using 1 threads]  69.5ms ± 0%  69.6ms ± 2%     ~     (p=0.605 n=10+15)

SSE4.3:

name                                       old cpu/op  new cpu/op  delta
BM_fullReduction_1T/3   [using 1 threads]  13.5ns ± 6%  13.1ns ± 4%   -2.72%  (p=0.000 n=59+60)
BM_fullReduction_1T/4   [using 1 threads]  13.2ns ± 8%  12.8ns ± 4%   -2.60%  (p=0.000 n=60+60)
BM_fullReduction_1T/7   [using 1 threads]  14.7ns ± 4%  14.5ns ± 5%   -1.16%  (p=0.014 n=48+60)
BM_fullReduction_1T/8   [using 1 threads]  14.8ns ± 4%  14.6ns ± 4%   -1.59%  (p=0.001 n=48+60)
BM_fullReduction_1T/10  [using 1 threads]  17.8ns ± 4%  16.5ns ± 5%   -7.15%  (p=0.000 n=48+60)
BM_fullReduction_1T/15  [using 1 threads]  29.9ns ± 7%  24.7ns ± 3%  -17.59%  (p=0.000 n=54+55)
BM_fullReduction_1T/16  [using 1 threads]  33.1ns ± 7%  27.1ns ± 4%  -18.35%  (p=0.000 n=47+54)
BM_fullReduction_1T/31  [using 1 threads]   123ns ± 4%    70ns ± 7%  -43.38%  (p=0.000 n=60+57)
BM_fullReduction_1T/32  [using 1 threads]   131ns ± 4%    78ns ± 7%  -40.77%  (p=0.000 n=60+60)
BM_fullReduction_1T/64  [using 1 threads]   534ns ± 4%   281ns ± 4%  -47.40%  (p=0.000 n=60+55)
BM_fullReduction_1T/128 [using 1 threads]  2.13µs ± 4%  1.23µs ± 4%  -42.17%  (p=0.000 n=60+60)
BM_fullReduction_1T/256 [using 1 threads]  8.54µs ± 4%  4.95µs ± 5%  -42.10%  (p=0.000 n=60+60)
BM_fullReduction_1T/512 [using 1 threads]  34.5µs ± 4%  20.2µs ± 4%  -41.43%  (p=0.000 n=50+60)
BM_fullReduction_1T/1k  [using 1 threads]   158µs ± 6%   154µs ± 5%   -2.46%  (p=0.000 n=60+60)
BM_fullReduction_1T/2k  [using 1 threads]   687µs ±25%   668µs ±23%     ~     (p=0.093 n=47+46)
BM_fullReduction_1T/4k  [using 1 threads]  5.86ms ± 6%  5.82ms ±10%     ~     (p=0.736 n=28+35)
BM_fullReduction_1T/10k [using 1 threads]  36.0ms ± 3%  35.5ms ± 3%     ~      (p=0.095 n=10+9)
 
 
name                                      old cpu/op  new cpu/op  delta
BM_rowReduction_1T/3   [using 1 threads]  28.8ns ± 4%  27.8ns ± 4%   -3.64%  (p=0.000 n=53+54)
BM_rowReduction_1T/4   [using 1 threads]  33.6ns ± 4%  33.7ns ± 6%     ~     (p=0.465 n=50+50)
BM_rowReduction_1T/7   [using 1 threads]  54.4ns ± 4%  52.9ns ± 4%   -2.81%  (p=0.000 n=60+60)
BM_rowReduction_1T/8   [using 1 threads]  53.8ns ± 4%  51.6ns ± 4%   -4.05%  (p=0.000 n=60+60)
BM_rowReduction_1T/10  [using 1 threads]  74.4ns ± 4%  71.2ns ± 4%   -4.39%  (p=0.000 n=60+58)
BM_rowReduction_1T/15  [using 1 threads]   113ns ± 4%   109ns ± 4%   -3.49%  (p=0.000 n=60+60)
BM_rowReduction_1T/16  [using 1 threads]   101ns ± 6%    97ns ± 6%   -3.91%  (p=0.000 n=60+60)
BM_rowReduction_1T/31  [using 1 threads]   250ns ± 4%   271ns ± 4%   +8.24%  (p=0.000 n=55+55)
BM_rowReduction_1T/32  [using 1 threads]   232ns ± 3%   222ns ± 4%   -4.31%  (p=0.000 n=55+59)
BM_rowReduction_1T/64  [using 1 threads]   654ns ± 4%   501ns ± 5%  -23.43%  (p=0.000 n=60+60)
BM_rowReduction_1T/128 [using 1 threads]  1.90µs ± 4%  1.62µs ± 5%  -14.84%  (p=0.000 n=60+59)
BM_rowReduction_1T/256 [using 1 threads]  7.07µs ± 4%  5.51µs ± 4%  -21.99%  (p=0.000 n=60+59)
BM_rowReduction_1T/512 [using 1 threads]  28.7µs ± 6%  21.1µs ± 4%  -26.28%  (p=0.000 n=55+60)
BM_rowReduction_1T/1k  [using 1 threads]   156µs ±10%   153µs ± 4%   -2.07%  (p=0.007 n=60+60)
BM_rowReduction_1T/2k  [using 1 threads]   705µs ±26%   678µs ±33%   -3.86%  (p=0.035 n=41+39)
BM_rowReduction_1T/4k  [using 1 threads]  7.04ms ±10%  6.31ms ± 8%  -10.45%  (p=0.000 n=41+36)
BM_rowReduction_1T/10k [using 1 threads]  42.6ms ± 6%  38.8ms ± 4%   -8.82%   (p=0.000 n=12+9)
 
name                                      old cpu/op  new cpu/op  delta
BM_colReduction_1T/3   [using 1 threads]  22.0ns ± 7%  22.1ns ± 7%     ~     (p=0.614 n=54+46)
BM_colReduction_1T/4   [using 1 threads]  20.6ns ± 5%  20.6ns ± 5%     ~     (p=0.771 n=60+48)
BM_colReduction_1T/7   [using 1 threads]  31.6ns ± 4%  31.6ns ± 3%     ~     (p=0.935 n=50+40)
BM_colReduction_1T/8   [using 1 threads]  27.8ns ± 9%  27.5ns ± 4%     ~     (p=0.113 n=45+44)
BM_colReduction_1T/10  [using 1 threads]  39.0ns ± 4%  38.6ns ± 5%   -0.86%  (p=0.048 n=50+40)
BM_colReduction_1T/15  [using 1 threads]  63.9ns ± 4%  63.1ns ± 4%   -1.20%  (p=0.005 n=60+48)
BM_colReduction_1T/16  [using 1 threads]  56.5ns ± 8%  47.2ns ± 9%  -16.50%  (p=0.000 n=59+49)
BM_colReduction_1T/31  [using 1 threads]   200ns ± 5%   145ns ± 8%  -27.33%  (p=0.000 n=60+60)
BM_colReduction_1T/32  [using 1 threads]   170ns ± 5%   100ns ± 6%  -40.78%  (p=0.000 n=60+55)
BM_colReduction_1T/64  [using 1 threads]   673ns ± 4%   291ns ± 5%  -56.83%  (p=0.000 n=60+55)
BM_colReduction_1T/128 [using 1 threads]  3.14µs ± 4%  2.43µs ± 6%  -22.70%  (p=0.000 n=50+55)
BM_colReduction_1T/256 [using 1 threads]  14.7µs ± 4%   9.6µs ± 5%  -35.06%  (p=0.000 n=60+60)
BM_colReduction_1T/512 [using 1 threads]  65.4µs ± 4%  44.2µs ± 5%  -32.42%  (p=0.000 n=59+59)
BM_colReduction_1T/1k  [using 1 threads]   761µs ± 8%   756µs ± 8%     ~     (p=0.274 n=60+60)
BM_colReduction_1T/2k  [using 1 threads]  3.22ms ±13%  3.27ms ±23%     ~     (p=0.629 n=37+37)
BM_colReduction_1T/4k  [using 1 threads]  70.9ms ±10%  69.8ms ± 8%   -1.47%  (p=0.028 n=40+40)
BM_colReduction_1T/10k [using 1 threads]  69.7ms ± 3%  79.6ms ± 2%  +14.22%  (p=0.000 n=13+14)
Edited by Rasmus Munk Larsen

Merge request reports

Loading