Optimize ThreadPool spinning
Optimize thread pool spinning: if we submit a task and have a thread in a spin loop we can avoid expensive parked thread notification and "handover" new task to a spinning thread. This change significantly reduces overheads and improves real time latency, except the one case when "main" thread submits and waits just one task.
Benchmark source: https://gist.github.com/ezhulenev/eae431a9c23459e326a17fe874ab01ee
Benchmark results:
AMD Rome (Zen2) CPU
name old cpu/op new cpu/op delta
BM_PingPong< 8, 1> 11.4µs ± 3% 13.4µs ± 5% +17.96% (p=0.000 n=96+96)
BM_PingPong< 8, 2> 22.6µs ± 3% 13.2µs ± 7% -41.56% (p=0.000 n=86+96)
BM_PingPong< 8, 4> 45.5µs ± 4% 25.3µs ± 6% -44.40% (p=0.000 n=94+93)
BM_PingPong< 16, 1> 11.3µs ± 5% 13.4µs ± 5% +18.37% (p=0.000 n=94+94)
BM_PingPong< 16, 2> 22.6µs ± 3% 12.6µs ± 6% -44.25% (p=0.000 n=92+98)
BM_PingPong< 16, 4> 45.4µs ± 4% 24.8µs ± 5% -45.43% (p=0.000 n=96+96)
BM_PingPong< 32, 1> 11.5µs ± 6% 13.3µs ± 5% +15.62% (p=0.000 n=98+93)
BM_PingPong< 32, 2> 22.9µs ± 4% 12.8µs ± 7% -44.25% (p=0.000 n=93+98)
BM_PingPong< 32, 4> 45.8µs ± 5% 25.5µs ± 7% -44.40% (p=0.000 n=94+99)
BM_PingPong< 64, 1> 11.8µs ± 5% 13.3µs ± 4% +12.35% (p=0.000 n=93+91)
BM_PingPong< 64, 2> 23.1µs ± 5% 13.1µs ± 6% -43.38% (p=0.000 n=95+99)
BM_PingPong< 64, 4> 46.3µs ± 6% 26.1µs ± 5% -43.62% (p=0.000 n=92+99)
BM_ThreadPool< 8, 10, 10> 1.12ms ± 4% 0.57ms ± 5% -48.92% (p=0.000 n=93+96)
BM_ThreadPool< 8, 10, 100> 11.2ms ± 4% 5.7ms ± 5% -49.04% (p=0.000 n=97+95)
BM_ThreadPool< 8, 10, 1000> 112ms ± 5% 57ms ± 5% -48.98% (p=0.000 n=96+98)
BM_ThreadPool< 16, 10, 10> 1.11ms ± 5% 0.57ms ± 6% -48.71% (p=0.000 n=97+99)
BM_ThreadPool< 16, 10, 100> 11.1ms ± 6% 5.7ms ± 7% -48.60% (p=0.000 n=96+100)
BM_ThreadPool< 16, 10, 1000> 111ms ± 6% 57ms ± 6% -48.55% (p=0.000 n=96+96)
BM_ThreadPool< 32, 10, 10> 1.11ms ± 8% 0.59ms ± 4% -46.78% (p=0.000 n=97+93)
BM_ThreadPool< 32, 10, 100> 11.2ms ± 5% 5.9ms ± 4% -47.24% (p=0.000 n=95+98)
BM_ThreadPool< 32, 10, 1000> 112ms ± 7% 59ms ± 4% -47.34% (p=0.000 n=93+98)
BM_ThreadPool< 64, 10, 10> 1.14ms ± 6% 0.62ms ± 3% -46.00% (p=0.000 n=91+92)
BM_ThreadPool< 64, 10, 100> 11.5ms ± 6% 6.2ms ± 3% -46.43% (p=0.000 n=89+95)
BM_ThreadPool< 64, 10, 1000> 114ms ± 8% 61ms ± 4% -45.95% (p=0.000 n=97+99)
name old time/op new time/op delta
BM_PingPong< 8, 1> 3.23µs ± 3% 8.85µs ± 6% +173.74% (p=0.000 n=95+97)
BM_PingPong< 8, 2> 6.38µs ± 3% 3.90µs ±12% -38.90% (p=0.000 n=90+98)
BM_PingPong< 8, 4> 12.8µs ± 4% 7.4µs ± 6% -42.11% (p=0.000 n=93+96)
BM_PingPong< 16, 1> 3.21µs ± 6% 8.83µs ± 5% +175.27% (p=0.000 n=95+94)
BM_PingPong< 16, 2> 6.37µs ± 4% 3.72µs ±12% -41.61% (p=0.000 n=94+100)
BM_PingPong< 16, 4> 12.7µs ± 6% 7.2µs ± 7% -43.56% (p=0.000 n=96+94)
BM_PingPong< 32, 1> 3.27µs ± 7% 8.80µs ± 5% +169.02% (p=0.000 n=98+94)
BM_PingPong< 32, 2> 6.41µs ± 6% 3.62µs ± 7% -43.61% (p=0.000 n=96+100)
BM_PingPong< 32, 4> 12.7µs ± 7% 7.2µs ± 8% -43.75% (p=0.000 n=93+98)
BM_PingPong< 64, 1> 3.31µs ± 6% 8.76µs ± 4% +164.56% (p=0.000 n=92+91)
BM_PingPong< 64, 2> 6.39µs ± 7% 3.65µs ± 7% -42.91% (p=0.000 n=94+98)
BM_PingPong< 64, 4> 12.7µs ± 8% 7.2µs ± 6% -43.03% (p=0.000 n=91+100)
BM_ThreadPool< 8, 10, 10> 313µs ± 6% 160µs ± 4% -48.76% (p=0.000 n=95+97)
BM_ThreadPool< 8, 10, 100> 3.13ms ± 4% 1.59ms ± 7% -49.29% (p=0.000 n=96+97)
BM_ThreadPool< 8, 10, 1000> 31.3ms ± 6% 15.9ms ± 6% -49.30% (p=0.000 n=97+99)
BM_ThreadPool< 16, 10, 10> 307µs ± 6% 159µs ± 8% -48.35% (p=0.000 n=97+99)
BM_ThreadPool< 16, 10, 100> 3.06ms ± 8% 1.57ms ± 8% -48.57% (p=0.000 n=96+99)
BM_ThreadPool< 16, 10, 1000> 30.6ms ± 9% 15.8ms ± 7% -48.35% (p=0.000 n=98+97)
BM_ThreadPool< 32, 10, 10> 306µs ±10% 164µs ± 5% -46.45% (p=0.000 n=96+94)
BM_ThreadPool< 32, 10, 100> 3.09ms ± 7% 1.62ms ± 5% -47.42% (p=0.000 n=93+95)
BM_ThreadPool< 32, 10, 1000> 30.8ms ± 9% 16.2ms ± 5% -47.39% (p=0.000 n=94+97)
BM_ThreadPool< 64, 10, 10> 312µs ± 9% 169µs ± 5% -45.81% (p=0.000 n=91+97)
BM_ThreadPool< 64, 10, 100> 3.15ms ± 7% 1.68ms ± 5% -46.75% (p=0.000 n=88+95)
BM_ThreadPool< 64, 10, 1000> 30.9ms ±11% 16.7ms ± 6% -46.10% (p=0.000 n=96+98)
name old INSTRUCTIONS/op new INSTRUCTIONS/op delta
BM_PingPong< 8, 1> 30.5k ± 4% 57.9k ± 6% +89.68% (p=0.000 n=93+95)
BM_PingPong< 8, 2> 60.4k ± 5% 37.6k ± 8% -37.75% (p=0.000 n=95+97)
BM_PingPong< 8, 4> 113k ± 4% 64k ± 6% -43.19% (p=0.000 n=94+93)
BM_PingPong< 16, 1> 32.8k ± 7% 59.9k ± 7% +82.69% (p=0.000 n=96+93)
BM_PingPong< 16, 2> 64.4k ± 5% 35.7k ± 7% -44.48% (p=0.000 n=97+99)
BM_PingPong< 16, 4> 122k ± 4% 65k ± 6% -46.86% (p=0.000 n=93+94)
BM_PingPong< 32, 1> 32.7k ± 8% 56.8k ± 6% +73.66% (p=0.000 n=96+91)
BM_PingPong< 32, 2> 64.0k ± 5% 34.6k ± 7% -46.01% (p=0.000 n=94+98)
BM_PingPong< 32, 4> 120k ± 6% 65k ± 5% -46.31% (p=0.000 n=96+94)
BM_PingPong< 64, 1> 36.8k ± 6% 60.3k ± 5% +63.57% (p=0.000 n=94+91)
BM_PingPong< 64, 2> 71.7k ± 6% 38.9k ± 8% -45.78% (p=0.000 n=94+98)
BM_PingPong< 64, 4> 138k ± 7% 75k ± 5% -46.07% (p=0.000 n=89+98)
BM_ThreadPool< 8, 10, 10> 2.93M ± 5% 1.43M ± 5% -51.08% (p=0.000 n=96+100)
BM_ThreadPool< 8, 10, 100> 28.7M ± 5% 13.9M ± 4% -51.38% (p=0.000 n=94+100)
BM_ThreadPool< 8, 10, 1000> 290M ± 6% 140M ± 5% -51.81% (p=0.000 n=96+100)
BM_ThreadPool< 16, 10, 10> 3.07M ± 7% 1.48M ± 5% -51.94% (p=0.000 n=97+98)
BM_ThreadPool< 16, 10, 100> 30.2M ± 9% 14.4M ± 6% -52.23% (p=0.000 n=94+99)
BM_ThreadPool< 16, 10, 1000> 303M ± 9% 145M ± 5% -52.03% (p=0.000 n=97+97)
BM_ThreadPool< 32, 10, 10> 3.04M ±11% 1.53M ± 5% -49.89% (p=0.000 n=96+100)
BM_ThreadPool< 32, 10, 100> 30.5M ± 9% 14.9M ± 5% -50.98% (p=0.000 n=93+100)
BM_ThreadPool< 32, 10, 1000> 305M ±10% 149M ± 4% -51.18% (p=0.000 n=95+100)
BM_ThreadPool< 64, 10, 10> 3.44M ±11% 1.77M ± 2% -48.48% (p=0.000 n=95+96)
BM_ThreadPool< 64, 10, 100> 34.8M ± 7% 17.6M ± 3% -49.62% (p=0.000 n=89+97)
BM_ThreadPool< 64, 10, 1000> 341M ±11% 173M ± 3% -49.23% (p=0.000 n=95+100)
Intel CascadeLake
name old cpu/op new cpu/op delta
BM_PingPong< 8, 1> 7.02µs ±22% 7.93µs ±12% +12.99% (p=0.000 n=97+89)
BM_PingPong< 8, 2> 13.8µs ±20% 8.8µs ±32% -36.22% (p=0.000 n=98+98)
BM_PingPong< 8, 4> 26.9µs ±17% 17.3µs ±25% -35.50% (p=0.000 n=95+98)
BM_PingPong< 16, 1> 7.09µs ±22% 7.90µs ±12% +11.35% (p=0.000 n=96+86)
BM_PingPong< 16, 2> 13.9µs ±25% 8.4µs ±27% -39.65% (p=0.000 n=97+98)
BM_PingPong< 16, 4> 27.8µs ±25% 16.8µs ±25% -39.72% (p=0.000 n=97+99)
BM_PingPong< 32, 1> 7.29µs ±31% 7.94µs ±10% +9.01% (p=0.000 n=98+93)
BM_PingPong< 32, 2> 14.3µs ±27% 8.5µs ±37% -40.56% (p=0.000 n=100+96)
BM_PingPong< 32, 4> 27.9µs ±24% 16.9µs ±27% -39.64% (p=0.000 n=100+100)
BM_PingPong< 64, 1> 7.77µs ±32% 7.92µs ± 8% +1.91% (p=0.012 n=97+89)
BM_PingPong< 64, 2> 15.2µs ±32% 9.3µs ±26% -38.40% (p=0.000 n=99+99)
BM_PingPong< 64, 4> 29.7µs ±27% 18.7µs ±34% -36.97% (p=0.000 n=100+98)
BM_ThreadPool< 8, 10, 10> 675µs ±22% 410µs ±22% -39.32% (p=0.000 n=100+100)
BM_ThreadPool< 8, 10, 100> 6.70ms ±19% 3.93ms ±20% -41.28% (p=0.000 n=98+100)
BM_ThreadPool< 8, 10, 1000> 67.3ms ±21% 39.4ms ±21% -41.37% (p=0.000 n=100+99)
BM_ThreadPool< 16, 10, 10> 688µs ±21% 403µs ±16% -41.48% (p=0.000 n=100+100)
BM_ThreadPool< 16, 10, 100> 6.78ms ±23% 3.97ms ±19% -41.46% (p=0.000 n=99+99)
BM_ThreadPool< 16, 10, 1000> 68.5ms ±19% 40.0ms ±30% -41.57% (p=0.000 n=100+100)
BM_ThreadPool< 32, 10, 10> 685µs ±20% 410µs ±18% -40.15% (p=0.000 n=100+100)
BM_ThreadPool< 32, 10, 100> 6.81ms ±20% 4.05ms ±28% -40.46% (p=0.000 n=100+99)
BM_ThreadPool< 32, 10, 1000> 68.5ms ±21% 40.6ms ±32% -40.74% (p=0.000 n=100+96)
BM_ThreadPool< 64, 10, 10> 727µs ±36% 447µs ±35% -38.51% (p=0.000 n=100+100)
BM_ThreadPool< 64, 10, 100> 7.28ms ±29% 4.43ms ±20% -39.14% (p=0.000 n=100+93)
BM_ThreadPool< 64, 10, 1000> 73.0ms ±40% 44.5ms ±40% -39.09% (p=0.000 n=100+95)
name old time/op new time/op delta
BM_PingPong< 8, 1> 1.84µs ± 7% 5.20µs ±12% +183.59% (p=0.000 n=87+87)
BM_PingPong< 8, 2> 3.59µs ± 6% 2.32µs ± 8% -35.45% (p=0.000 n=92+94)
BM_PingPong< 8, 4> 7.06µs ± 7% 4.50µs ± 7% -36.25% (p=0.000 n=97+91)
BM_PingPong< 16, 1> 1.86µs ± 8% 5.18µs ± 9% +178.11% (p=0.000 n=88+84)
BM_PingPong< 16, 2> 3.59µs ± 8% 2.11µs ± 5% -41.20% (p=0.000 n=87+82)
BM_PingPong< 16, 4> 7.08µs ± 6% 4.17µs ± 5% -41.20% (p=0.000 n=93+84)
BM_PingPong< 32, 1> 1.91µs ± 7% 5.23µs ±11% +173.02% (p=0.000 n=84+93)
BM_PingPong< 32, 2> 3.64µs ±10% 2.05µs ± 7% -43.53% (p=0.000 n=94+78)
BM_PingPong< 32, 4> 7.04µs ± 8% 4.06µs ± 5% -42.24% (p=0.000 n=99+83)
BM_PingPong< 64, 1> 2.01µs ±13% 5.19µs ± 9% +158.41% (p=0.000 n=88+89)
BM_PingPong< 64, 2> 3.70µs ±13% 2.27µs ±14% -38.82% (p=0.000 n=92+84)
BM_PingPong< 64, 4> 7.15µs ±12% 4.46µs ±15% -37.62% (p=0.000 n=97+85)
BM_ThreadPool< 8, 10, 10> 173µs ± 7% 94µs ± 8% -45.57% (p=0.000 n=97+88)
BM_ThreadPool< 8, 10, 100> 1.74ms ± 5% 0.92ms ± 7% -47.11% (p=0.000 n=98+88)
BM_ThreadPool< 8, 10, 1000> 17.3ms ± 7% 9.1ms ± 6% -47.17% (p=0.000 n=100+88)
BM_ThreadPool< 16, 10, 10> 173µs ± 8% 95µs ±10% -45.03% (p=0.000 n=100+91)
BM_ThreadPool< 16, 10, 100> 1.73ms ± 8% 0.93ms ± 8% -45.86% (p=0.000 n=100+93)
BM_ThreadPool< 16, 10, 1000> 17.3ms ± 6% 9.2ms ± 6% -46.83% (p=0.000 n=99+89)
BM_ThreadPool< 32, 10, 10> 169µs ± 6% 96µs ±12% -43.28% (p=0.000 n=99+94)
BM_ThreadPool< 32, 10, 100> 1.69ms ± 6% 0.93ms ±10% -44.83% (p=0.000 n=100+93)
BM_ThreadPool< 32, 10, 1000> 16.8ms ± 8% 9.3ms ±10% -44.65% (p=0.000 n=100+92)
BM_ThreadPool< 64, 10, 10> 167µs ±10% 98µs ±13% -41.16% (p=0.000 n=98+91)
BM_ThreadPool< 64, 10, 100> 1.66ms ± 9% 0.97ms ±10% -41.92% (p=0.000 n=95+90)
BM_ThreadPool< 64, 10, 1000> 16.5ms ±11% 9.6ms ± 9% -42.06% (p=0.000 n=94+90)
name old INSTRUCTIONS/op new INSTRUCTIONS/op delta
BM_PingPong< 8, 1> 13.8k ± 6% 33.8k ±12% +144.95% (p=0.000 n=89+91)
BM_PingPong< 8, 2> 27.0k ± 5% 18.6k ± 6% -31.04% (p=0.000 n=85+95)
BM_PingPong< 8, 4> 52.0k ± 4% 35.1k ± 7% -32.46% (p=0.000 n=87+96)
BM_PingPong< 16, 1> 14.8k ± 3% 33.5k ±16% +125.67% (p=0.000 n=86+94)
BM_PingPong< 16, 2> 28.3k ± 4% 16.2k ±10% -42.94% (p=0.000 n=84+97)
BM_PingPong< 16, 4> 54.9k ± 3% 31.2k ± 9% -43.19% (p=0.000 n=87+97)
BM_PingPong< 32, 1> 16.3k ± 5% 31.7k ±10% +94.63% (p=0.000 n=91+91)
BM_PingPong< 32, 2> 30.1k ± 3% 16.3k ±13% -45.80% (p=0.000 n=88+91)
BM_PingPong< 32, 4> 58.0k ± 3% 31.6k ± 8% -45.44% (p=0.000 n=92+87)
BM_PingPong< 64, 1> 19.5k ± 4% 33.1k ±10% +70.24% (p=0.000 n=92+95)
BM_PingPong< 64, 2> 35.5k ± 5% 21.4k ± 5% -39.87% (p=0.000 n=90+84)
BM_PingPong< 64, 4> 68.5k ± 4% 42.7k ± 8% -37.57% (p=0.000 n=87+85)
BM_ThreadPool< 8, 10, 10> 1.27M ± 5% 0.70M ± 8% -44.47% (p=0.000 n=96+93)
BM_ThreadPool< 8, 10, 100> 12.8M ± 4% 6.6M ± 9% -48.26% (p=0.000 n=91+92)
BM_ThreadPool< 8, 10, 1000> 128M ± 4% 68M ±10% -47.22% (p=0.000 n=92+99)
BM_ThreadPool< 16, 10, 10> 1.34M ± 4% 0.70M ± 6% -47.64% (p=0.000 n=89+96)
BM_ThreadPool< 16, 10, 100> 13.4M ± 4% 6.8M ± 7% -49.70% (p=0.000 n=84+91)
BM_ThreadPool< 16, 10, 1000> 134M ± 3% 68M ± 8% -49.12% (p=0.000 n=82+96)
BM_ThreadPool< 32, 10, 10> 1.39M ± 3% 0.75M ± 6% -45.75% (p=0.000 n=85+97)
BM_ThreadPool< 32, 10, 100> 13.9M ± 3% 7.3M ± 6% -47.68% (p=0.000 n=84+91)
BM_ThreadPool< 32, 10, 1000> 139M ± 3% 74M ± 7% -46.94% (p=0.000 n=83+93)
BM_ThreadPool< 64, 10, 10> 1.61M ± 3% 0.96M ± 4% -40.60% (p=0.000 n=85+90)
BM_ThreadPool< 64, 10, 100> 16.2M ± 2% 9.4M ± 4% -41.73% (p=0.000 n=84+88)
BM_ThreadPool< 64, 10, 1000> 161M ± 3% 94M ± 5% -41.53% (p=0.000 n=82+85)