Skip to content

Optimize ThreadPool spinning

Optimize thread pool spinning: if we submit a task and have a thread in a spin loop we can avoid expensive parked thread notification and "handover" new task to a spinning thread. This change significantly reduces overheads and improves real time latency, except the one case when "main" thread submits and waits just one task.

Benchmark source: https://gist.github.com/ezhulenev/eae431a9c23459e326a17fe874ab01ee

Benchmark results:

AMD Rome (Zen2) CPU

name                           old cpu/op   new cpu/op   delta
BM_PingPong< 8, 1>             11.4µs ± 3%  13.4µs ± 5%  +17.96%   (p=0.000 n=96+96)
BM_PingPong< 8, 2>             22.6µs ± 3%  13.2µs ± 7%  -41.56%   (p=0.000 n=86+96)
BM_PingPong< 8, 4>             45.5µs ± 4%  25.3µs ± 6%  -44.40%   (p=0.000 n=94+93)
BM_PingPong< 16, 1>            11.3µs ± 5%  13.4µs ± 5%  +18.37%   (p=0.000 n=94+94)
BM_PingPong< 16, 2>            22.6µs ± 3%  12.6µs ± 6%  -44.25%   (p=0.000 n=92+98)
BM_PingPong< 16, 4>            45.4µs ± 4%  24.8µs ± 5%  -45.43%   (p=0.000 n=96+96)
BM_PingPong< 32, 1>            11.5µs ± 6%  13.3µs ± 5%  +15.62%   (p=0.000 n=98+93)
BM_PingPong< 32, 2>            22.9µs ± 4%  12.8µs ± 7%  -44.25%   (p=0.000 n=93+98)
BM_PingPong< 32, 4>            45.8µs ± 5%  25.5µs ± 7%  -44.40%   (p=0.000 n=94+99)
BM_PingPong< 64, 1>            11.8µs ± 5%  13.3µs ± 4%  +12.35%   (p=0.000 n=93+91)
BM_PingPong< 64, 2>            23.1µs ± 5%  13.1µs ± 6%  -43.38%   (p=0.000 n=95+99)
BM_PingPong< 64, 4>            46.3µs ± 6%  26.1µs ± 5%  -43.62%   (p=0.000 n=92+99)
BM_ThreadPool< 8, 10, 10>      1.12ms ± 4%  0.57ms ± 5%  -48.92%   (p=0.000 n=93+96)
BM_ThreadPool< 8, 10, 100>     11.2ms ± 4%   5.7ms ± 5%  -49.04%   (p=0.000 n=97+95)
BM_ThreadPool< 8, 10, 1000>     112ms ± 5%    57ms ± 5%  -48.98%   (p=0.000 n=96+98)
BM_ThreadPool< 16, 10, 10>     1.11ms ± 5%  0.57ms ± 6%  -48.71%   (p=0.000 n=97+99)
BM_ThreadPool< 16, 10, 100>    11.1ms ± 6%   5.7ms ± 7%  -48.60%  (p=0.000 n=96+100)
BM_ThreadPool< 16, 10, 1000>    111ms ± 6%    57ms ± 6%  -48.55%   (p=0.000 n=96+96)
BM_ThreadPool< 32, 10, 10>     1.11ms ± 8%  0.59ms ± 4%  -46.78%   (p=0.000 n=97+93)
BM_ThreadPool< 32, 10, 100>    11.2ms ± 5%   5.9ms ± 4%  -47.24%   (p=0.000 n=95+98)
BM_ThreadPool< 32, 10, 1000>    112ms ± 7%    59ms ± 4%  -47.34%   (p=0.000 n=93+98)
BM_ThreadPool< 64, 10, 10>     1.14ms ± 6%  0.62ms ± 3%  -46.00%   (p=0.000 n=91+92)
BM_ThreadPool< 64, 10, 100>    11.5ms ± 6%   6.2ms ± 3%  -46.43%   (p=0.000 n=89+95)
BM_ThreadPool< 64, 10, 1000>    114ms ± 8%    61ms ± 4%  -45.95%   (p=0.000 n=97+99)

name                           old time/op             new time/op             delta
BM_PingPong< 8, 1>             3.23µs ± 3%             8.85µs ± 6%  +173.74%        (p=0.000 n=95+97)
BM_PingPong< 8, 2>             6.38µs ± 3%             3.90µs ±12%   -38.90%        (p=0.000 n=90+98)
BM_PingPong< 8, 4>             12.8µs ± 4%              7.4µs ± 6%   -42.11%        (p=0.000 n=93+96)
BM_PingPong< 16, 1>            3.21µs ± 6%             8.83µs ± 5%  +175.27%        (p=0.000 n=95+94)
BM_PingPong< 16, 2>            6.37µs ± 4%             3.72µs ±12%   -41.61%       (p=0.000 n=94+100)
BM_PingPong< 16, 4>            12.7µs ± 6%              7.2µs ± 7%   -43.56%        (p=0.000 n=96+94)
BM_PingPong< 32, 1>            3.27µs ± 7%             8.80µs ± 5%  +169.02%        (p=0.000 n=98+94)
BM_PingPong< 32, 2>            6.41µs ± 6%             3.62µs ± 7%   -43.61%       (p=0.000 n=96+100)
BM_PingPong< 32, 4>            12.7µs ± 7%              7.2µs ± 8%   -43.75%        (p=0.000 n=93+98)
BM_PingPong< 64, 1>            3.31µs ± 6%             8.76µs ± 4%  +164.56%        (p=0.000 n=92+91)
BM_PingPong< 64, 2>            6.39µs ± 7%             3.65µs ± 7%   -42.91%        (p=0.000 n=94+98)
BM_PingPong< 64, 4>            12.7µs ± 8%              7.2µs ± 6%   -43.03%       (p=0.000 n=91+100)
BM_ThreadPool< 8, 10, 10>       313µs ± 6%              160µs ± 4%   -48.76%        (p=0.000 n=95+97)
BM_ThreadPool< 8, 10, 100>     3.13ms ± 4%             1.59ms ± 7%   -49.29%        (p=0.000 n=96+97)
BM_ThreadPool< 8, 10, 1000>    31.3ms ± 6%             15.9ms ± 6%   -49.30%        (p=0.000 n=97+99)
BM_ThreadPool< 16, 10, 10>      307µs ± 6%              159µs ± 8%   -48.35%        (p=0.000 n=97+99)
BM_ThreadPool< 16, 10, 100>    3.06ms ± 8%             1.57ms ± 8%   -48.57%        (p=0.000 n=96+99)
BM_ThreadPool< 16, 10, 1000>   30.6ms ± 9%             15.8ms ± 7%   -48.35%        (p=0.000 n=98+97)
BM_ThreadPool< 32, 10, 10>      306µs ±10%              164µs ± 5%   -46.45%        (p=0.000 n=96+94)
BM_ThreadPool< 32, 10, 100>    3.09ms ± 7%             1.62ms ± 5%   -47.42%        (p=0.000 n=93+95)
BM_ThreadPool< 32, 10, 1000>   30.8ms ± 9%             16.2ms ± 5%   -47.39%        (p=0.000 n=94+97)
BM_ThreadPool< 64, 10, 10>      312µs ± 9%              169µs ± 5%   -45.81%        (p=0.000 n=91+97)
BM_ThreadPool< 64, 10, 100>    3.15ms ± 7%             1.68ms ± 5%   -46.75%        (p=0.000 n=88+95)
BM_ThreadPool< 64, 10, 1000>   30.9ms ±11%             16.7ms ± 6%   -46.10%        (p=0.000 n=96+98)

name                           old INSTRUCTIONS/op     new INSTRUCTIONS/op     delta
BM_PingPong< 8, 1>              30.5k ± 4%              57.9k ± 6%   +89.68%        (p=0.000 n=93+95)
BM_PingPong< 8, 2>              60.4k ± 5%              37.6k ± 8%   -37.75%        (p=0.000 n=95+97)
BM_PingPong< 8, 4>               113k ± 4%                64k ± 6%   -43.19%        (p=0.000 n=94+93)
BM_PingPong< 16, 1>             32.8k ± 7%              59.9k ± 7%   +82.69%        (p=0.000 n=96+93)
BM_PingPong< 16, 2>             64.4k ± 5%              35.7k ± 7%   -44.48%        (p=0.000 n=97+99)
BM_PingPong< 16, 4>              122k ± 4%                65k ± 6%   -46.86%        (p=0.000 n=93+94)
BM_PingPong< 32, 1>             32.7k ± 8%              56.8k ± 6%   +73.66%        (p=0.000 n=96+91)
BM_PingPong< 32, 2>             64.0k ± 5%              34.6k ± 7%   -46.01%        (p=0.000 n=94+98)
BM_PingPong< 32, 4>              120k ± 6%                65k ± 5%   -46.31%        (p=0.000 n=96+94)
BM_PingPong< 64, 1>             36.8k ± 6%              60.3k ± 5%   +63.57%        (p=0.000 n=94+91)
BM_PingPong< 64, 2>             71.7k ± 6%              38.9k ± 8%   -45.78%        (p=0.000 n=94+98)
BM_PingPong< 64, 4>              138k ± 7%                75k ± 5%   -46.07%        (p=0.000 n=89+98)
BM_ThreadPool< 8, 10, 10>       2.93M ± 5%              1.43M ± 5%   -51.08%       (p=0.000 n=96+100)
BM_ThreadPool< 8, 10, 100>      28.7M ± 5%              13.9M ± 4%   -51.38%       (p=0.000 n=94+100)
BM_ThreadPool< 8, 10, 1000>      290M ± 6%               140M ± 5%   -51.81%       (p=0.000 n=96+100)
BM_ThreadPool< 16, 10, 10>      3.07M ± 7%              1.48M ± 5%   -51.94%        (p=0.000 n=97+98)
BM_ThreadPool< 16, 10, 100>     30.2M ± 9%              14.4M ± 6%   -52.23%        (p=0.000 n=94+99)
BM_ThreadPool< 16, 10, 1000>     303M ± 9%               145M ± 5%   -52.03%        (p=0.000 n=97+97)
BM_ThreadPool< 32, 10, 10>      3.04M ±11%              1.53M ± 5%   -49.89%       (p=0.000 n=96+100)
BM_ThreadPool< 32, 10, 100>     30.5M ± 9%              14.9M ± 5%   -50.98%       (p=0.000 n=93+100)
BM_ThreadPool< 32, 10, 1000>     305M ±10%               149M ± 4%   -51.18%       (p=0.000 n=95+100)
BM_ThreadPool< 64, 10, 10>      3.44M ±11%              1.77M ± 2%   -48.48%        (p=0.000 n=95+96)
BM_ThreadPool< 64, 10, 100>     34.8M ± 7%              17.6M ± 3%   -49.62%        (p=0.000 n=89+97)
BM_ThreadPool< 64, 10, 1000>     341M ±11%               173M ± 3%   -49.23%       (p=0.000 n=95+100)

Intel CascadeLake

name                          old cpu/op   new cpu/op   delta
BM_PingPong< 8, 1>            7.02µs ±22%  7.93µs ±12%  +12.99%    (p=0.000 n=97+89)
BM_PingPong< 8, 2>            13.8µs ±20%   8.8µs ±32%  -36.22%    (p=0.000 n=98+98)
BM_PingPong< 8, 4>            26.9µs ±17%  17.3µs ±25%  -35.50%    (p=0.000 n=95+98)
BM_PingPong< 16, 1>           7.09µs ±22%  7.90µs ±12%  +11.35%    (p=0.000 n=96+86)
BM_PingPong< 16, 2>           13.9µs ±25%   8.4µs ±27%  -39.65%    (p=0.000 n=97+98)
BM_PingPong< 16, 4>           27.8µs ±25%  16.8µs ±25%  -39.72%    (p=0.000 n=97+99)
BM_PingPong< 32, 1>           7.29µs ±31%  7.94µs ±10%   +9.01%    (p=0.000 n=98+93)
BM_PingPong< 32, 2>           14.3µs ±27%   8.5µs ±37%  -40.56%   (p=0.000 n=100+96)
BM_PingPong< 32, 4>           27.9µs ±24%  16.9µs ±27%  -39.64%  (p=0.000 n=100+100)
BM_PingPong< 64, 1>           7.77µs ±32%  7.92µs ± 8%   +1.91%    (p=0.012 n=97+89)
BM_PingPong< 64, 2>           15.2µs ±32%   9.3µs ±26%  -38.40%    (p=0.000 n=99+99)
BM_PingPong< 64, 4>           29.7µs ±27%  18.7µs ±34%  -36.97%   (p=0.000 n=100+98)
BM_ThreadPool< 8, 10, 10>      675µs ±22%   410µs ±22%  -39.32%  (p=0.000 n=100+100)
BM_ThreadPool< 8, 10, 100>    6.70ms ±19%  3.93ms ±20%  -41.28%   (p=0.000 n=98+100)
BM_ThreadPool< 8, 10, 1000>   67.3ms ±21%  39.4ms ±21%  -41.37%   (p=0.000 n=100+99)
BM_ThreadPool< 16, 10, 10>     688µs ±21%   403µs ±16%  -41.48%  (p=0.000 n=100+100)
BM_ThreadPool< 16, 10, 100>   6.78ms ±23%  3.97ms ±19%  -41.46%    (p=0.000 n=99+99)
BM_ThreadPool< 16, 10, 1000>  68.5ms ±19%  40.0ms ±30%  -41.57%  (p=0.000 n=100+100)
BM_ThreadPool< 32, 10, 10>     685µs ±20%   410µs ±18%  -40.15%  (p=0.000 n=100+100)
BM_ThreadPool< 32, 10, 100>   6.81ms ±20%  4.05ms ±28%  -40.46%   (p=0.000 n=100+99)
BM_ThreadPool< 32, 10, 1000>  68.5ms ±21%  40.6ms ±32%  -40.74%   (p=0.000 n=100+96)
BM_ThreadPool< 64, 10, 10>     727µs ±36%   447µs ±35%  -38.51%  (p=0.000 n=100+100)
BM_ThreadPool< 64, 10, 100>   7.28ms ±29%  4.43ms ±20%  -39.14%   (p=0.000 n=100+93)
BM_ThreadPool< 64, 10, 1000>  73.0ms ±40%  44.5ms ±40%  -39.09%   (p=0.000 n=100+95)

name                          old time/op             new time/op             delta
BM_PingPong< 8, 1>            1.84µs ± 7%             5.20µs ±12%  +183.59%        (p=0.000 n=87+87)
BM_PingPong< 8, 2>            3.59µs ± 6%             2.32µs ± 8%   -35.45%        (p=0.000 n=92+94)
BM_PingPong< 8, 4>            7.06µs ± 7%             4.50µs ± 7%   -36.25%        (p=0.000 n=97+91)
BM_PingPong< 16, 1>           1.86µs ± 8%             5.18µs ± 9%  +178.11%        (p=0.000 n=88+84)
BM_PingPong< 16, 2>           3.59µs ± 8%             2.11µs ± 5%   -41.20%        (p=0.000 n=87+82)
BM_PingPong< 16, 4>           7.08µs ± 6%             4.17µs ± 5%   -41.20%        (p=0.000 n=93+84)
BM_PingPong< 32, 1>           1.91µs ± 7%             5.23µs ±11%  +173.02%        (p=0.000 n=84+93)
BM_PingPong< 32, 2>           3.64µs ±10%             2.05µs ± 7%   -43.53%        (p=0.000 n=94+78)
BM_PingPong< 32, 4>           7.04µs ± 8%             4.06µs ± 5%   -42.24%        (p=0.000 n=99+83)
BM_PingPong< 64, 1>           2.01µs ±13%             5.19µs ± 9%  +158.41%        (p=0.000 n=88+89)
BM_PingPong< 64, 2>           3.70µs ±13%             2.27µs ±14%   -38.82%        (p=0.000 n=92+84)
BM_PingPong< 64, 4>           7.15µs ±12%             4.46µs ±15%   -37.62%        (p=0.000 n=97+85)
BM_ThreadPool< 8, 10, 10>      173µs ± 7%               94µs ± 8%   -45.57%        (p=0.000 n=97+88)
BM_ThreadPool< 8, 10, 100>    1.74ms ± 5%             0.92ms ± 7%   -47.11%        (p=0.000 n=98+88)
BM_ThreadPool< 8, 10, 1000>   17.3ms ± 7%              9.1ms ± 6%   -47.17%       (p=0.000 n=100+88)
BM_ThreadPool< 16, 10, 10>     173µs ± 8%               95µs ±10%   -45.03%       (p=0.000 n=100+91)
BM_ThreadPool< 16, 10, 100>   1.73ms ± 8%             0.93ms ± 8%   -45.86%       (p=0.000 n=100+93)
BM_ThreadPool< 16, 10, 1000>  17.3ms ± 6%              9.2ms ± 6%   -46.83%        (p=0.000 n=99+89)
BM_ThreadPool< 32, 10, 10>     169µs ± 6%               96µs ±12%   -43.28%        (p=0.000 n=99+94)
BM_ThreadPool< 32, 10, 100>   1.69ms ± 6%             0.93ms ±10%   -44.83%       (p=0.000 n=100+93)
BM_ThreadPool< 32, 10, 1000>  16.8ms ± 8%              9.3ms ±10%   -44.65%       (p=0.000 n=100+92)
BM_ThreadPool< 64, 10, 10>     167µs ±10%               98µs ±13%   -41.16%        (p=0.000 n=98+91)
BM_ThreadPool< 64, 10, 100>   1.66ms ± 9%             0.97ms ±10%   -41.92%        (p=0.000 n=95+90)
BM_ThreadPool< 64, 10, 1000>  16.5ms ±11%              9.6ms ± 9%   -42.06%        (p=0.000 n=94+90)

name                          old INSTRUCTIONS/op     new INSTRUCTIONS/op     delta
BM_PingPong< 8, 1>             13.8k ± 6%              33.8k ±12%  +144.95%        (p=0.000 n=89+91)
BM_PingPong< 8, 2>             27.0k ± 5%              18.6k ± 6%   -31.04%        (p=0.000 n=85+95)
BM_PingPong< 8, 4>             52.0k ± 4%              35.1k ± 7%   -32.46%        (p=0.000 n=87+96)
BM_PingPong< 16, 1>            14.8k ± 3%              33.5k ±16%  +125.67%        (p=0.000 n=86+94)
BM_PingPong< 16, 2>            28.3k ± 4%              16.2k ±10%   -42.94%        (p=0.000 n=84+97)
BM_PingPong< 16, 4>            54.9k ± 3%              31.2k ± 9%   -43.19%        (p=0.000 n=87+97)
BM_PingPong< 32, 1>            16.3k ± 5%              31.7k ±10%   +94.63%        (p=0.000 n=91+91)
BM_PingPong< 32, 2>            30.1k ± 3%              16.3k ±13%   -45.80%        (p=0.000 n=88+91)
BM_PingPong< 32, 4>            58.0k ± 3%              31.6k ± 8%   -45.44%        (p=0.000 n=92+87)
BM_PingPong< 64, 1>            19.5k ± 4%              33.1k ±10%   +70.24%        (p=0.000 n=92+95)
BM_PingPong< 64, 2>            35.5k ± 5%              21.4k ± 5%   -39.87%        (p=0.000 n=90+84)
BM_PingPong< 64, 4>            68.5k ± 4%              42.7k ± 8%   -37.57%        (p=0.000 n=87+85)
BM_ThreadPool< 8, 10, 10>      1.27M ± 5%              0.70M ± 8%   -44.47%        (p=0.000 n=96+93)
BM_ThreadPool< 8, 10, 100>     12.8M ± 4%               6.6M ± 9%   -48.26%        (p=0.000 n=91+92)
BM_ThreadPool< 8, 10, 1000>     128M ± 4%                68M ±10%   -47.22%        (p=0.000 n=92+99)
BM_ThreadPool< 16, 10, 10>     1.34M ± 4%              0.70M ± 6%   -47.64%        (p=0.000 n=89+96)
BM_ThreadPool< 16, 10, 100>    13.4M ± 4%               6.8M ± 7%   -49.70%        (p=0.000 n=84+91)
BM_ThreadPool< 16, 10, 1000>    134M ± 3%                68M ± 8%   -49.12%        (p=0.000 n=82+96)
BM_ThreadPool< 32, 10, 10>     1.39M ± 3%              0.75M ± 6%   -45.75%        (p=0.000 n=85+97)
BM_ThreadPool< 32, 10, 100>    13.9M ± 3%               7.3M ± 6%   -47.68%        (p=0.000 n=84+91)
BM_ThreadPool< 32, 10, 1000>    139M ± 3%                74M ± 7%   -46.94%        (p=0.000 n=83+93)
BM_ThreadPool< 64, 10, 10>     1.61M ± 3%              0.96M ± 4%   -40.60%        (p=0.000 n=85+90)
BM_ThreadPool< 64, 10, 100>    16.2M ± 2%               9.4M ± 4%   -41.73%        (p=0.000 n=84+88)
BM_ThreadPool< 64, 10, 1000>    161M ± 3%                94M ± 5%   -41.53%        (p=0.000 n=82+85)

Merge request reports

Loading