Move to a threading scheduler
Summary
This is a meta issue to track the work to move towards a threading scheduler.
Rationale
Historically, we had a threaded scheduler. We moved to a multiprocessed one, in order to speed up BuildStream, which was, at the time, doing a lot of operations in the main python process.
BuildStream has evolved a lot, and a lot of the long running tasks have been delegated to other tools (buildbox-casd, buildbox-run) and we are thus running less long running operations in BuildStream core.
It is therefore less important now to be able to run the core on multiple CPUs at the same time.
We are at a point, where, the multi processing is more of a burden than a help, and we should move back to a threaded scheduler.
We should get the following benefits out of it:
- Easier to debug plugins and scheduler code, since everything will be in the same process, pdb and such tools should have an easier time.
- Easier error handling and reporting. Since we will have a more visible trace if a thread goes awry
- Less
fork()
overhead, since forking in python is expensive (noticeable in benchmarks even) - Easier Windows/MacOS support (for BuildStream only, buildbox-* would still need other things), since we would not need to rely on
fork()
, as Windows doesn't support it, and MacOS's default is to not use it as it can cause problems - We will stop being in a unsupported usage of Python, since using
asyncio
together with forking is explicitely not supported and has lots of various problems.
This will however bring some disadvantages:
- More pressure on the ioloop
- We need to keep long running, blocking operations to a minimum in the scheduler
- Potential concurrency bugs, now that we share the same address space
- Single CPU usage, no more easy multi-cpu work.
I believe the benefits outweight the problems, and we should therefore move.
Work to do
-
Move BuildStream to a threaded scheduler, without too may invasive changes (!1982 (merged)) -
Remove the parent-child 'Job' separation, and stop sending messages across when we can avoid -
Rewrite the messenger, to allow specifying a 'job' messenger, to allow a better integration with the ioloop -
Integrate the CasUsageMonitor into the IOLoop instead of having it as a separate thread
Potential optimisations / areas to look into
-
Using a per-queue pool of threads for finer handling -
Remove need for child watchers, and try alternative loop implementation (uvloop?) -
Move more of the pre-public API code to be async, for better integration