The source project of this merge request has been removed.
Speed up tensor reduction
Speed up tensor reduction by strip mining & unrolling loops in InnerMostDimReducer
and InnerMostDimPreserved
.
This change also cleans up a few redundant pieces of code, where deferring to an existing specialization was possible.
Below are measurements of full-, row-, and column- sum reductions of square 2D float tensors with sizes ranging from 3 x 3 to 10k x 10k. These were measured single-threaded on a Skylake core, and compiled with clang approximately at head.
AVX2:
name old cpu/op new cpu/op delta
BM_fullReduction_1T/3 [using 1 threads] 13.4ns ± 8% 15.1ns ± 4% +12.18% (p=0.000 n=49+60)
BM_fullReduction_1T/4 [using 1 threads] 13.0ns ± 4% 15.4ns ±18% +18.06% (p=0.000 n=48+60)
BM_fullReduction_1T/7 [using 1 threads] 15.0ns ±13% 16.4ns ± 3% +9.29% (p=0.000 n=48+47)
BM_fullReduction_1T/8 [using 1 threads] 15.8ns ±19% 16.7ns ±13% +5.84% (p=0.000 n=60+60)
BM_fullReduction_1T/10 [using 1 threads] 18.7ns ±11% 18.5ns ± 9% ~ (p=0.292 n=48+60)
BM_fullReduction_1T/15 [using 1 threads] 31.1ns ±12% 22.4ns ±17% -27.80% (p=0.000 n=52+57)
BM_fullReduction_1T/16 [using 1 threads] 34.3ns ±10% 23.0ns ±13% -32.75% (p=0.000 n=50+58)
BM_fullReduction_1T/31 [using 1 threads] 125ns ± 5% 48ns ± 9% -61.81% (p=0.000 n=60+60)
BM_fullReduction_1T/32 [using 1 threads] 134ns ± 6% 50ns ± 8% -62.70% (p=0.000 n=60+57)
BM_fullReduction_1T/64 [using 1 threads] 535ns ± 4% 160ns ± 5% -70.06% (p=0.000 n=60+60)
BM_fullReduction_1T/128 [using 1 threads] 2.15µs ± 4% 0.69µs ± 8% -67.65% (p=0.000 n=60+60)
BM_fullReduction_1T/256 [using 1 threads] 8.55µs ± 4% 2.77µs ± 5% -67.65% (p=0.000 n=60+55)
BM_fullReduction_1T/512 [using 1 threads] 34.5µs ± 3% 11.6µs ± 6% -66.52% (p=0.000 n=50+60)
BM_fullReduction_1T/1k [using 1 threads] 155µs ± 4% 158µs ± 4% +1.73% (p=0.000 n=60+60)
BM_fullReduction_1T/2k [using 1 threads] 682µs ±20% 684µs ±17% ~ (p=0.475 n=40+45)
BM_fullReduction_1T/4k [using 1 threads] 6.34ms ±12% 5.71ms ±11% -9.98% (p=0.000 n=39+35)
BM_fullReduction_1T/10k [using 1 threads] 37.4ms ± 7% 37.4ms ±32% ~ (p=0.481 n=10+10)
name old cpu/op new cpu/op delta
BM_rowReduction_1T/3 [using 1 threads] 29.0ns ± 7% 30.5ns ± 4% +5.10% (p=0.000 n=54+50)
BM_rowReduction_1T/4 [using 1 threads] 33.5ns ± 3% 38.5ns ± 4% +15.07% (p=0.000 n=50+50)
BM_rowReduction_1T/7 [using 1 threads] 54.6ns ± 4% 60.8ns ± 8% +11.40% (p=0.000 n=59+60)
BM_rowReduction_1T/8 [using 1 threads] 55.1ns ± 8% 52.1ns ± 9% -5.40% (p=0.000 n=60+60)
BM_rowReduction_1T/10 [using 1 threads] 75.8ns ± 7% 72.2ns ± 7% -4.66% (p=0.000 n=60+60)
BM_rowReduction_1T/15 [using 1 threads] 114ns ± 5% 123ns ± 6% +7.98% (p=0.000 n=60+60)
BM_rowReduction_1T/16 [using 1 threads] 102ns ± 5% 95ns ± 7% -6.74% (p=0.000 n=60+60)
BM_rowReduction_1T/31 [using 1 threads] 250ns ± 5% 264ns ± 4% +5.56% (p=0.000 n=55+55)
BM_rowReduction_1T/32 [using 1 threads] 232ns ± 4% 203ns ± 9% -12.47% (p=0.000 n=55+60)
BM_rowReduction_1T/64 [using 1 threads] 651ns ± 4% 482ns ± 6% -25.95% (p=0.000 n=60+60)
BM_rowReduction_1T/128 [using 1 threads] 1.90µs ± 3% 1.30µs ± 7% -31.67% (p=0.000 n=60+60)
BM_rowReduction_1T/256 [using 1 threads] 7.03µs ± 5% 3.69µs ± 5% -47.44% (p=0.000 n=60+49)
BM_rowReduction_1T/512 [using 1 threads] 28.6µs ± 4% 13.3µs ± 6% -53.36% (p=0.000 n=54+60)
BM_rowReduction_1T/1k [using 1 threads] 158µs ± 9% 157µs ± 4% ~ (p=0.948 n=60+60)
BM_rowReduction_1T/2k [using 1 threads] 733µs ±37% 657µs ±13% -10.36% (p=0.000 n=45+40)
BM_rowReduction_1T/4k [using 1 threads] 6.65ms ±11% 6.19ms ± 9% -6.89% (p=0.032 n=30+38)
BM_rowReduction_1T/10k [using 1 threads] 41.4ms ±11% 37.8ms ± 1% ~ (p=0.080 n=12+10)
name old cpu/op new cpu/op delta
BM_colReduction_1T/3 [using 1 threads] 21.8ns ± 5% 22.4ns ± 4% +2.34% (p=0.000 n=58+55)
BM_colReduction_1T/4 [using 1 threads] 20.8ns ± 6% 27.7ns ± 6% +33.27% (p=0.000 n=60+55)
BM_colReduction_1T/7 [using 1 threads] 32.0ns ± 4% 43.9ns ± 6% +37.53% (p=0.000 n=48+60)
BM_colReduction_1T/8 [using 1 threads] 28.7ns ±11% 24.8ns ± 3% -13.81% (p=0.000 n=53+55)
BM_colReduction_1T/10 [using 1 threads] 39.9ns ± 7% 37.8ns ± 4% -5.12% (p=0.000 n=53+50)
BM_colReduction_1T/15 [using 1 threads] 65.0ns ±10% 77.2ns ± 6% +18.79% (p=0.000 n=58+57)
BM_colReduction_1T/16 [using 1 threads] 56.5ns ± 7% 43.0ns ±21% -23.92% (p=0.000 n=48+60)
BM_colReduction_1T/31 [using 1 threads] 203ns ± 5% 210ns ± 6% +3.46% (p=0.000 n=60+59)
BM_colReduction_1T/32 [using 1 threads] 170ns ± 8% 95ns ± 7% -44.18% (p=0.000 n=60+60)
BM_colReduction_1T/64 [using 1 threads] 677ns ± 7% 261ns ± 4% -61.43% (p=0.000 n=60+55)
BM_colReduction_1T/128 [using 1 threads] 3.14µs ± 4% 1.40µs ± 5% -55.45% (p=0.000 n=50+60)
BM_colReduction_1T/256 [using 1 threads] 14.8µs ± 4% 5.4µs ± 6% -63.24% (p=0.000 n=60+60)
BM_colReduction_1T/512 [using 1 threads] 65.2µs ± 5% 25.2µs ± 5% -61.31% (p=0.000 n=60+55)
BM_colReduction_1T/1k [using 1 threads] 754µs ± 6% 393µs ± 5% -47.92% (p=0.000 n=60+45)
BM_colReduction_1T/2k [using 1 threads] 3.24ms ±18% 1.66ms ±17% -48.61% (p=0.000 n=35+42)
BM_colReduction_1T/4k [using 1 threads] 70.3ms ± 3% 34.5ms ± 3% -50.93% (p=0.000 n=44+25)
BM_colReduction_1T/10k [using 1 threads] 69.5ms ± 0% 69.6ms ± 2% ~ (p=0.605 n=10+15)
SSE4.3:
name old cpu/op new cpu/op delta
BM_fullReduction_1T/3 [using 1 threads] 13.5ns ± 6% 13.1ns ± 4% -2.72% (p=0.000 n=59+60)
BM_fullReduction_1T/4 [using 1 threads] 13.2ns ± 8% 12.8ns ± 4% -2.60% (p=0.000 n=60+60)
BM_fullReduction_1T/7 [using 1 threads] 14.7ns ± 4% 14.5ns ± 5% -1.16% (p=0.014 n=48+60)
BM_fullReduction_1T/8 [using 1 threads] 14.8ns ± 4% 14.6ns ± 4% -1.59% (p=0.001 n=48+60)
BM_fullReduction_1T/10 [using 1 threads] 17.8ns ± 4% 16.5ns ± 5% -7.15% (p=0.000 n=48+60)
BM_fullReduction_1T/15 [using 1 threads] 29.9ns ± 7% 24.7ns ± 3% -17.59% (p=0.000 n=54+55)
BM_fullReduction_1T/16 [using 1 threads] 33.1ns ± 7% 27.1ns ± 4% -18.35% (p=0.000 n=47+54)
BM_fullReduction_1T/31 [using 1 threads] 123ns ± 4% 70ns ± 7% -43.38% (p=0.000 n=60+57)
BM_fullReduction_1T/32 [using 1 threads] 131ns ± 4% 78ns ± 7% -40.77% (p=0.000 n=60+60)
BM_fullReduction_1T/64 [using 1 threads] 534ns ± 4% 281ns ± 4% -47.40% (p=0.000 n=60+55)
BM_fullReduction_1T/128 [using 1 threads] 2.13µs ± 4% 1.23µs ± 4% -42.17% (p=0.000 n=60+60)
BM_fullReduction_1T/256 [using 1 threads] 8.54µs ± 4% 4.95µs ± 5% -42.10% (p=0.000 n=60+60)
BM_fullReduction_1T/512 [using 1 threads] 34.5µs ± 4% 20.2µs ± 4% -41.43% (p=0.000 n=50+60)
BM_fullReduction_1T/1k [using 1 threads] 158µs ± 6% 154µs ± 5% -2.46% (p=0.000 n=60+60)
BM_fullReduction_1T/2k [using 1 threads] 687µs ±25% 668µs ±23% ~ (p=0.093 n=47+46)
BM_fullReduction_1T/4k [using 1 threads] 5.86ms ± 6% 5.82ms ±10% ~ (p=0.736 n=28+35)
BM_fullReduction_1T/10k [using 1 threads] 36.0ms ± 3% 35.5ms ± 3% ~ (p=0.095 n=10+9)
name old cpu/op new cpu/op delta
BM_rowReduction_1T/3 [using 1 threads] 28.8ns ± 4% 27.8ns ± 4% -3.64% (p=0.000 n=53+54)
BM_rowReduction_1T/4 [using 1 threads] 33.6ns ± 4% 33.7ns ± 6% ~ (p=0.465 n=50+50)
BM_rowReduction_1T/7 [using 1 threads] 54.4ns ± 4% 52.9ns ± 4% -2.81% (p=0.000 n=60+60)
BM_rowReduction_1T/8 [using 1 threads] 53.8ns ± 4% 51.6ns ± 4% -4.05% (p=0.000 n=60+60)
BM_rowReduction_1T/10 [using 1 threads] 74.4ns ± 4% 71.2ns ± 4% -4.39% (p=0.000 n=60+58)
BM_rowReduction_1T/15 [using 1 threads] 113ns ± 4% 109ns ± 4% -3.49% (p=0.000 n=60+60)
BM_rowReduction_1T/16 [using 1 threads] 101ns ± 6% 97ns ± 6% -3.91% (p=0.000 n=60+60)
BM_rowReduction_1T/31 [using 1 threads] 250ns ± 4% 271ns ± 4% +8.24% (p=0.000 n=55+55)
BM_rowReduction_1T/32 [using 1 threads] 232ns ± 3% 222ns ± 4% -4.31% (p=0.000 n=55+59)
BM_rowReduction_1T/64 [using 1 threads] 654ns ± 4% 501ns ± 5% -23.43% (p=0.000 n=60+60)
BM_rowReduction_1T/128 [using 1 threads] 1.90µs ± 4% 1.62µs ± 5% -14.84% (p=0.000 n=60+59)
BM_rowReduction_1T/256 [using 1 threads] 7.07µs ± 4% 5.51µs ± 4% -21.99% (p=0.000 n=60+59)
BM_rowReduction_1T/512 [using 1 threads] 28.7µs ± 6% 21.1µs ± 4% -26.28% (p=0.000 n=55+60)
BM_rowReduction_1T/1k [using 1 threads] 156µs ±10% 153µs ± 4% -2.07% (p=0.007 n=60+60)
BM_rowReduction_1T/2k [using 1 threads] 705µs ±26% 678µs ±33% -3.86% (p=0.035 n=41+39)
BM_rowReduction_1T/4k [using 1 threads] 7.04ms ±10% 6.31ms ± 8% -10.45% (p=0.000 n=41+36)
BM_rowReduction_1T/10k [using 1 threads] 42.6ms ± 6% 38.8ms ± 4% -8.82% (p=0.000 n=12+9)
name old cpu/op new cpu/op delta
BM_colReduction_1T/3 [using 1 threads] 22.0ns ± 7% 22.1ns ± 7% ~ (p=0.614 n=54+46)
BM_colReduction_1T/4 [using 1 threads] 20.6ns ± 5% 20.6ns ± 5% ~ (p=0.771 n=60+48)
BM_colReduction_1T/7 [using 1 threads] 31.6ns ± 4% 31.6ns ± 3% ~ (p=0.935 n=50+40)
BM_colReduction_1T/8 [using 1 threads] 27.8ns ± 9% 27.5ns ± 4% ~ (p=0.113 n=45+44)
BM_colReduction_1T/10 [using 1 threads] 39.0ns ± 4% 38.6ns ± 5% -0.86% (p=0.048 n=50+40)
BM_colReduction_1T/15 [using 1 threads] 63.9ns ± 4% 63.1ns ± 4% -1.20% (p=0.005 n=60+48)
BM_colReduction_1T/16 [using 1 threads] 56.5ns ± 8% 47.2ns ± 9% -16.50% (p=0.000 n=59+49)
BM_colReduction_1T/31 [using 1 threads] 200ns ± 5% 145ns ± 8% -27.33% (p=0.000 n=60+60)
BM_colReduction_1T/32 [using 1 threads] 170ns ± 5% 100ns ± 6% -40.78% (p=0.000 n=60+55)
BM_colReduction_1T/64 [using 1 threads] 673ns ± 4% 291ns ± 5% -56.83% (p=0.000 n=60+55)
BM_colReduction_1T/128 [using 1 threads] 3.14µs ± 4% 2.43µs ± 6% -22.70% (p=0.000 n=50+55)
BM_colReduction_1T/256 [using 1 threads] 14.7µs ± 4% 9.6µs ± 5% -35.06% (p=0.000 n=60+60)
BM_colReduction_1T/512 [using 1 threads] 65.4µs ± 4% 44.2µs ± 5% -32.42% (p=0.000 n=59+59)
BM_colReduction_1T/1k [using 1 threads] 761µs ± 8% 756µs ± 8% ~ (p=0.274 n=60+60)
BM_colReduction_1T/2k [using 1 threads] 3.22ms ±13% 3.27ms ±23% ~ (p=0.629 n=37+37)
BM_colReduction_1T/4k [using 1 threads] 70.9ms ±10% 69.8ms ± 8% -1.47% (p=0.028 n=40+40)
BM_colReduction_1T/10k [using 1 threads] 69.7ms ± 3% 79.6ms ± 2% +14.22% (p=0.000 n=13+14)
Edited by Rasmus Munk Larsen