Skip to content

Use ThreadPoolExecutor in ShardedStorage to parallelize bulk operations to different shards

Jeremiah Bonney requested to merge jbonney/sharded-storage-threadpool into master

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests - Covered by existing tests
  • Metrics - N/A
  • Documentation update(s) - Added option description to parser

Description

This PR aims to speed up ShardedStorage by leveraging ThreadPoolExecutor to enable parallel operations on each individual shard. Before each shard would be accessed sequentially which ended up being quite slow and ignores a big benefit of sharding. One ThreadPoolExecutor is created for the entire storage and it's size can be set in the configuration. I opted to go this route instead of spinning up a smaller ThreadPoolExecutor per request because the scaling is more predictable and tunable.

We already do have a lot of threads, but my hope is that storage operations should be lots of I/O so we will still get benefit from the additional threads.

As part of implementing this I added a small helper to context.py to allow copying all the BuildGrid ContextVars into the sharded storage worker threads, which keeps the logging working as expected.

Merge request reports