Run UI in a seperatate process to the scheduler.

Background

In the last round of profiling data collected by @danielsilverstone-ct, we could see that UI rendering is taking up a significant proportion of the CPU time in BuildStream's main process.

Further investigation has shown that the UI rendering is a performance bottleneck. This is particularly obvious when IO is slow and writing to the screen is expensive, with @BenjaminSchubert observing on his system (using BuildStream in docker through bash.exe on a Windows machine) that 40% of the time is spent writing to the screen, with the main process running at 100% of a cpu. Having the UI and the scheduler run is separate processes would remove this bottleneck.

One possible approach is to have the UI remain in the main process, with the scheduling process running as a subprocess. The scheduling process will be managed and abstracted by the Stream, without being known to the front-end. As part of this work, message handling will be refactored so that all state required by the front-end will be passed explicitly with each Message object, where the front-end currently relies on some global state. All signals and user input will be handled by the front-end, which will tell the stream what the appropriate actions are.

The details of how interactive shells will work have not yet been ironed out, though various approaches have been discussed on the list. This will need some experimentation.

This was discussed in this ML thread. Whilst there are still details to iron out, we're at a point where it makes sense to put together a proof of concept to verify that the proposed approach will work and that we can actually achieve some performance improvements.

Task description

Current task proposal is below. Please note that points 1-4 exist in various states in the branch phil/ui-split-refactor - this branch contains the PoC and our plan is to try to merge the below points as small, digestable MRs into master, and re-base our branch as we go.

Rework LoadError exception to bring arg ordering inline with the rest of exception classes. The subclassing of exceptions brings issues under pickle especially when a given positional arg is named message, amongst other issues. Instead of hacking around to make this work a single MR should be completed to align the ordering throughout the codebase. Some info around this is here and the work is captured in !1489 (merged)
MR to add the changes for the message/job construction. This MR will include changes to the message class itself and how they are generated. For instance an elementjob will use it's element_full_name as the unique_id to avoid a full Plugin table lookup in the main process just to gather the same string (this isn't just wasteful, it's also not going to work with concurrency of the given state). It should also make the messages more 'correct', in the sense that the data is encoded when the message is generated instead of later in the process. This lives at !1500 (merged)
MR to remove the need for anything in the frontend (App) to require an Element or Queue instance, with it operating on just the name after which the 'backend' will process against the given object. This might require a generic implementation that will then be utilised with the bidirectional notification handler. This lives at !1546 (merged)
Add basis of 'NotificationHandler' between stream-scheduler. An MR can be created for the initial handler implementation, which abstracts the callbacks from the frontend (job start,complete,interrupt,tick) to the scheduler via stream. This should be a multiprocessing queue used for further frontend-subprocess notifications which need to emit into the main process (exception handling, queue/task group state, task errors etc). This needs to be bi-directional to allow the app to notify the 'backend' on which objects (such as Elements which will exist in different states in each process) to act on. This lives at !1550 (merged)
PoC/MVP for multiprocessing implementation. This implementation will take the form of a context manager(Update it doesn't seem plausible to support pickling of subproccessed methods under a context manager, so this approach will be different) which covers the entry points into stream, which go on to invoke a scheduler. At which exact point the process split happens is not yet final, as having certain things loaded in the frontend may prove useful, for example when handling an interactive build failure shell. Initially this will be applied to build as the most prominent usecase, also ensuring the basis of the architectural changes are benchmarked and the impelentation is reviewed before undertaking it across further methods. This MR will include changes that are needed to support the addition of the concurrency beyond the existing concurrency (the scheduler jobs are already multiprocessed, meaning there's extra complexity for ensuring state is mirrored in the new 'main' process). There might also need to be work around the handling of status rendering for Tasks across State/Messenger, which did not exist at the time of the original implementation. This lives at !1613 (closed)
Further work on the separation of processes. The queue into the subprocess (specifically the scheduler) needs to be handled in a bi-directional manner, more than likely achieved by have two queues, one for each direction. The frontend app currently initiates the scheduler terminating it's asyncio processes, this should be handleded by the queue into the sub-processed operation. Once the scheduler is fully abstracted into a subprocess, the SIG handling should be moved into the main process as we will be able to run another asyncio loop for this. With this in place the 'frontend/main process' will instruct the scheduler how to behave when such signals are caught, switching the current control flow of the scheduler waiting for the frontend to respond with what it should do.
Once there's an async event loop on both sides of the notification handler (stream & scheduler) then benchmarks & profiling should be assessed to determine the characteristics between non-interactive bst builds (sig handling not needed), with & without the multiprocess implementation. These should be conducted on both linux & WSL (WSL is where click.echo was showing the most overheads in profiling). At this point it should be decided upon if the added code complexity and maintenance cost (especially if landed in an opt-in/experimental state) is worth any originally perceived benefits of the process seperation
Handling 'on-the-fly' build failures in interactive mode, where the user delegates that they wish to load a build shell into the sandbox. There is still no concrete plan on how to handle this (as seen on the mailing list thread). (This may now come under 3rd bulletpoint, to be included with the notification handling. As it stands the initial implementation may require the selection to be loaded via the element name, as is the case if loading a new build shell)

Acceptance Criteria

Edited Nov 28, 2019 by Tom Pollard

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information