Message-based concurrency (and parallelism)

I doubt there are many truly entirely single-threaded programs written any more, because for example of the need for asynchronous, non-blocking operations such as network requests.

JavaScript’s “single-threaded” is not data race safe

Even JavaScript which runs all source code in a single-thread, fails to insure data race safety because it is reentrant. There’s shared access to mutable data between (what the programmer has reasoned to be sequential) processes¹ (aka ‘task’) that execute partially before being interrupted to allow another task in the same source code to execute. This is a form of data race unsafety when said concurrent tasks share access to mutable data because although the said concurrent tasks aren’t running on the CPU simultaneously, they are logically concurrent because they each don’t always run to completion before another of said tasks executes. They execute round-robin on the CPU but they are interrupted and don’t run to completion before being round-robin executed. For example, an algorithm that presumes some mutable data won’t change until the algorithm completes, but the algorithm sleeps (in a callback) while waiting for a non-blocking², asynchronous operation to complete and another event handler or callback may run interim and modify the shared mutable data.

JavaScript can context-switch between tasks at a non-blocking, asynchronous operation by supplying a callback to the operation and returning up the call hierarchy to the event queue dispatcher (in the JavaScript runtime) which will call the registered event handler function for the next event in the said queue. The completion of an asynchronous operation will schedule the dispatcher to wake up the callback³ by adding an “execute callback event” on said queue.

In modern JavaScript, these callbacks are wrapped in the Promise abstraction and perhaps even the async/await code transformation⁴ which creates the required callbacks from what appears to be synchronous, sequential (i.e. not multithreaded reentrant) source code.

Data race safety is the fundamental requirement for concurrency

The primary and most fundamental consideration for a concurrency model is not the mechanics of the context-switching between tasks which must wait on a high latency asynchronous operation to complete, but rather the paradigm by which data race safety is provably attained. Data race safety means the program doesn’t have shared mutable data that is accessible by more than one not yet completed task. In addition to the JavaScript case, this invariant can obviously be violated by multiple threads that execute tasks simultaneously on the CPU.

The problem with data races is that they’re (practically unbounded) non-deterministic and thus implausible (if not intractable or impossible) to reason about. The program will have unpredictable Heisenbugs.

Locking isn’t provably data race safe (and doesn’t scale)

One means of attempting data race safety is by employing runtime mutexes and other multithreading synchronization primitives, which are essentially a locking paradigm for ordering multithreaded access to mutable state. The problem is there’s usually (due the Halting problem) no way to prove their use won’t result in a live or deadlock error. Thus in essence the data race unsafety still exists. When such primitives are employed liberally they tend to create complex code with lurking, obscure, nearly impossible to debug Heisenbugs.

Per Amdahl’s law, synchronization primitves destroy scalability. Linus Torvalds agrees. The original Go scheduler suffered from global locking.

A quote from Sebastian Blessing’s masters’ project paper sums it up well:

It is questionable if using threads for concurrency directly within a language is an appropriate solution to these problems. The need for synchronization mechanisms to avoid races makes developing applications fundamentally error prone and unnecessarily complex for the programmer and distract from the actual problem to solve. Additionally, the fact that the provided degree of fine-grained synchronization leads to the opportunity to reach a maximum degree of efficiency and concurrency within a system is disputable.

One could contemplate a transpiler that compiles to JavaScript which would track which mutable JavaScript objects are shared between reentrant code paths (i.e. code paths capable of executing while any of the other code paths sharing same mutable objects are sleeping on a callback), and then attempt to detect and disallow algorithms which make data race unsafe assumptions. But such algorithm analysis is intractable. And reentrant code paths would essentially be all code that shares a mutable object, because it would also be intractable for example to ascertain which events can fire while other tasks are sleeping in a callback. Alternatively a transpiler could disallow access to any shared mutable data after any asynchronous callback. But that’s an onerous restriction.

Although avoiding locking by sharing only immutable state doesn’t prevent higher-order semantic bugs including deadlocks and livelocks because mutable state dependencies can be modeled in other ways, there are still multiple benefits gained by insuring first-level data race safety. The analogous point applies to referential transparency. The theoretical foundation is Russell’s Paradox (c.f. also and also).

Rust’s data race safety model

Another alternative is Rust’s provably data race safe model which disallows sharing mutable data between any code paths. Rust employs exclusive mutable borrowing coupled with static (i.e. at compile-time) resource lifetime annotations and analysis. We have discussed extensively⁵ (and even compared it to Pony), that Rust’s model is a whole program total ordering of exclusive mutability. Rust presumes all the code is reentrant. One benefit of such a total ordering of exclusive mutability is that Rust can for example prove that there’s no aliasing on arguments thus proving that the equivalent of C’s restrict is safe (instead of trusting the programmer to make an error-prone judgement of safety). And Rust can prevent certain semantic programming errors such as the iterator invalidation (i.e. mutation of a collection which is also being iterated). Note Pony’s reference capabilities can also provably prevent iterator invalidation. But Rust’s lifetime analysis has false negatives (i.e. can’t prove that every safe lifetime is actually safe) and rectifying this is presumably intractable. Rust’s libraries are littered with unsafe code that the compiler can’t prove is safe.

Rust’s onerous whole program proofs are a significant tsuris to pay for tasks which are truly sequential (i.e. single-threaded tasks without any interleaved reentrancy). Given truly sequential tasks, the remaining benefits of Rust’s model are the aforementioned low level restrict optimization, prevention of a minute portion of the multitudes of possible semantic programming errors (Scala and Haskell’s GADTs can even statically check for semantic errors in the use of protocols) and the lifetime tracking for provably safe (i.e. no erroneous use-after-free segfaults) stack allocation which is more robust than other forms of escape analysis. This provably sound stack allocation and the data race safety for interleaved, reentrant tasks give Rust’s model an ideal fit for some class of high reliability, highest performance, lowest power consumption, highly interactive programs. Yet I’m nearly certain it’s not the best fit for a mainstream general purpose programming language because of the complexity of coding in that model. Also when the Rust programmer sometimes necessarily punts⁶ from stack to heap allocation, the automatic memory mgmt (AMM) — whether it be less performant automatic reference counting (ARC) (which also requires tracing to detect and deallocate stranded islands of cyclical references) or tracing garbage collection — doesn’t scale under multithreading because tracing stops-the-world for all threads.⁷ Also the multithread synchronization (i.e. locking) required for mutating ARC reference counts in effect removes absolute data race safety (although bugs from such might be rare and obscure).

Rust provides no means to partition data with message-based communication for scaling to massively multicore.

Message-based data race safety

Rust provides data race safety for code that is fully multithreaded. Yet that required high-level of code complexity is also apparently the only conceivable way it would be possible to prove JavaScript’s (single-threaded) event queue reentrancy is all forms of data race safe. So that seems to imply that probably JavaScript’s concurrency model is an incorrect design choice because it isn’t message-based, non-reentrant.

Other than another paradigm for non-partitioned sharing by enforcing ubiquitous immutability with copying to emulate mutable state such as with Haskell’s and PureScript’s continuation monad or Clojure’s persistant data structures, the alternative to Rust’s model is a message-based partitioning of concurrency wherein each partition can run in a separate OS thread (i.e. simultaneously on the CPU if there are enough cores). The partitioning of exclusive mutability for sharing data is demarcated by the message-based communication boundary that separates communicating sequential processes (CSP). The distinction from Rust is that each message-based communication bounded partition runs a single truly sequential task. Thus there’s no requirement to prove exclusive mutability for the code of each task. The exclusive mutability only has to be proven for mutable data that is accessible (i.e. shared) between more than one said partition. The sequential task can sleep waiting on an asynchronous operation, but there’s no reentrancy because any context-switch of the OS thread that was formerly executing the sleeping task, will only switch to a task in a different partition. Non-blocking, asynchronous operations are handled by sending and receiving messages. Incoming message event processing is sequential instead of reentrant. The single task for each said partition is not reentrant, thus it processes the next incoming message only after it completes processing of the prior message.

Pony’s model and the Actor model

Pony has reference capabilities (c.f. also) types for statically (i.e. at compile-time) tracking data sharing to provably insure inter-partition sharing of mutable data is exclusive. Intra-partition sharing of data (which doesn’t need exclusive mutability because the single task of the partition is single-threaded and not reentrant) is also tracked only because it’s required for the tracking of said inter-partition sharing.

A quote from George Steed’s masters’ project paper sums it up well:

Pony is a~~n actor-model,~~ concurrent programming language developed with the aim of being able to write high performance and data-race free programs that naturally take advantage of the multiple cores present in modern computers without exposing the user to complex and difficult to diagnose issues such as synchronisation correctness. Utilising the actor model is a natural choice for concurrency: actors themselves are naturally independent and can execute in isolation, only needing to communicate when passing messages to one another. In order to support passing objects to other actors without the need to copy them each time, which would be a substantial blow to performance, Pony utilises a system of capabilities to ensure that data-races cannot occur due to shared objects.

Pony advertises that it employs the Actor model, but that is a misnomer which is apparently industry-wide adopted (e.g. Akka’s “actors”). The garbage collection algorithms⁸ that Pony employs for both the collection of “actor” objects (i.e. the partitions) and inter-partition shared objects requires that message delivery be causal (i.e. partially but not totally ordered, c.f. §2.3 Causality in Distributed Systems). Whereas, the Actor model which also has message-based bounded partitions named actors with a sequential (single) task for each actor, differs from CSP model in that messages aren’t guaranteed to be ordered nor even arrive. Where Pony refers to “indeterminate time”, they mean the behavior call is added to a message queue, but still causally ordered. Although both the Actor and CSP models can simulate some features of each other, they fundamentally differ in that CSP’s communication is deterministic aka bounded “non-determinism”. CSP is intended for guaranteed communication channels such as multicore hardware (c.f. §Implementation considerations). Whereas per the oft-cited, fundamental FLP theorem (c.f. §2.4 Failure Detection in Asynchronous Networks), the Actor model requires stochastic modeling of inconsistency with only probabilistic eventual consistency fault tolerance for unbounded non-deterministic⁹ asynchronous communication channels such as over the Internet.

Sharing doesn’t scale

Pony’s ORCA garbage collection algorithm for inter-partition shared objects is inefficient because of the conceptual generative essense of Amdahl’s law in this case that “multiple garbage collectors running in parallel with each other and with the running program, can add significant … synchronization overhead … reducing the advantages of parallelism”. They even advise to reduce sharing. I previously posited that sharing of data will not scale nearly as well to massively multicore as a modicum of message-based copying of data between cores. Partitioned message-based communication delineates data sharing and thus even delineated sharing instead of copying can in theory scale slightly better than the hardware cache coherency and stop-the-world garbage collection that non-partitioned paradigms (e.g. Rust, Java, Continuation Monads, etc.) require.

It’s ironic that the claimed superiority of Haskell is vacated by massively multicore scaling. Continuation monads, persistent data structures (or any form of non-partitioned immutability) will not scale as well to massively multicore as (e.g. Pony’s) data race safe sharing between message-based partitions (for a hypothetical software cache coherency MIMD shared memory). And not sharing at all between cores is posited to scale even better for the distributed memory variant of MIMD.

The alternative to sharing data structures is copying a modicum of data that is sent in messages. Some programs or portions of programs can be structured to not employ inter-partition sharing of large data structures. Avoiding sharing eliminates one form of parallelism. In cases where sharing is unavoidable, Pony’s reference capabilities and ORCA garbage collection model is probably (near to) optimal.

Note the clever and scalable algorithm⁸ that Pony employs to garbage collect the CSP-like “actors” would still be required even if inter-“actor” sharing is avoided for maximizing scaling.

Asynchronous blocking operations in green threads

The runtime scheduler¹⁰ for managed M:N green threads¹¹ such as Go’s goroutines may context-switch the OS thread to a different task at any usermode cooperative (i.e. non-preemptive) opportunity. As previously mentioned that for data race safety, there’s only one task in each message-based-communication bounded, non-reentrant partition (such as Go’s CSP partitions named goroutines but not C++’s boost.fiber). Thus a blocked OS thread that can be context-switched will be context-switched to the task in a different goroutine.

There’s obviously no usermode opportunity to context-switch an OS thread that is blocked inside¹² a system call. Only the OS preemptively context-switches CPU processors (i.e. cores or hyperthreads) between OS threads after each quantum (approximately 100 Hz) or for a hardware interrupt priority. Note (when there’s M:N threading) the distinction that OS threads are non-preemptively context-switched between tasks by usermode code. Whereas, processors are preemptively context-switched between OS threads by the OS.

Task initiated, blocking operations block the initiating task but not always the associated OS thread. Blocking operations that don’t block inside a system call (e.g. a goroutine waiting on input from its CSP channel message queue or “network operations”) should cause a usermode non-preemptive context-switch of the OS thread to an unblocked (i.e. runnable) task. This avoids the slow blocking and unblocking of the OS thread (by a system call initiated by the scheduler) or the overhead of idle spinning the OS thread. The OS will prioritize the allocation of processor time to those OS threads which aren’t repetitively yielding to the OS (a form of idle spinning) and aren’t blocked with a system call. Yet there’s a tradeoff to context-switching the processor in a blocking operation because idle spinning the processor instead, can improve latency (to a degree relative to the duration of the operation) at the cost of CPU utilitization dependent throughput. Some ultra low latency designs (usually employing custom hardware) which prioritize idling the processor, will not rely on the OS’ default behavior but this isn’t suitable in the general purpose case where latency variance is high.

Go’s scheduler¹⁰ spawns more than one OS thread per processor to compensate for OS threads blocked inside system calls, so that no processor is starved of usermode tasks (i.e. goroutines) to run. Pony doesn’t spawn additional OS threads because it doesn’t have stackful green threads, thus can’t enable the OS to preemptively context-switch blocking operations to another “actor”, and is only able to usermode context-switch on message queues.

Spawning goroutines posited to be as harmful as `goto`

Since goroutines make the context-switch opaque in the source code, the programmer of the source code for the calling (aka parent) task is given no explicit indication that a blocking operation may have appreciable latency and is capable of executing asynchronously. Thus the programmer may not know there’s an opportunity to (nor have a means to) explicitly express safe opportunities for parallelism in a sequential control flow coding style (i.e. without needing to use explicit goroutines and channels that are cumbersome and by default break control flow dependencies such as error handling¹³ although abstracting message queues as functions and optionally returning a Promise could enable modeling them as blocking functions). The desired parallelism means to concurrently execute any mutually data race safe and non-dependent code that follows the blocking operation in the source code of the calling task, including any additional asynchronously capable blocking operations. It may be imprudent and intractable (c.f. also) for the compiler to autonomously analyze to discover optimal and data race safe opportunities for mutually non-dependent concurrency. This is especially true given the implausibility of avoiding ambient and semantics ordering races not expressed explicitly in the code. Although no data race safety model can provably enforce what is not expressed to the compiler. Whereas, non-blocking operations make the context-switches explicit in the source code.

One could contemplate an alternative green threading design which would explicitly indicate in the source code (analogous to JavaScript’s Promise.all) which blocking operations are capable of safely running asynchronously, so the programmer could explicitly indicate (optional at the runtime scheduler’s discretion) opportunities for parallelism. Even the abstraction of message-based channels as iterable streams could be incorporated. The compiler would need to check (perhaps via Pony’s reference capabilities?) what it can to ensure data race safety. Perhaps some optimizations could be made in some cases instead of always allocating for each of these locally concurrent tasks the separate stack required for each goroutine. For example it may be more efficient to execute the high latency network blocking operations on another goroutine (amortizing over the high latency the extra resources cost as compared to sequential execution) and run the low latency locally concurrent operation on the current goroutine. It may also be worthwhile to avoid the extra stack allocations by executing the those high latency network operations with heap allocated callbacks awoken with a JavaScript-like event queue instead of the netpoller waking up goroutines. Although perhaps being optimistic about required stack size and allocating smaller growable stacks in only this case would be as performant and reasonably heap efficient.

Green threads vs. CPS or callbacks

Compared to non-blocking (callback-based or continuation passing type aka CPS) cooperative multitasking coroutines, stackful green threads execute more efficiently, can have more (and because they’re not required to be explicit) cooperative context-switch opportunities for avoiding problematic latency and don’t break stack traces (nor break exceptions which are broken in non-blocking cooperative multitasking if they’re not correctly transformed⁴ by a transpiler or in the compilation of async/await). Blocking green threads in single-task, message-based-communication bounded paritioning paradigm don’t incur the data race unsafety issue with “schedule points” in the willy-nilly “cancel and schedule points” concerns otherwise required for concurrency. And in every correctly implemented concurrency paradigm the “cancel points” should always propagate an exception to the code path that is waiting on the asynchronous operation (or event) instead of returning it to the ”cancel points”.

Both green threads and CPS are more efficient to context-switch than callback-based (because they don’t have to unwind the stack with heap allocation). Although stackful green threads must allocate an entire (even if growable) stack for each concurrent task; whereas, there’s at best only one stack per per processor (i.e. core or hyperthread) for CPS and callback-based. In general the overhead of usermode cooperative multitasking is greater than¹⁴ for preemptive kernel OS thread context-switching when the latter’s more heavyweight context¹¹ (thus less efficient context-switch) isn’t the main cost factor. When there’s significantly more tasks than than OS threads then the heavyweight kernel context-switch cost dominates, thus M:N threading (which is normally an attribute of usermode cooperative multitasking) can be more efficient and scale better. Although green thread context-switches (when not at function calls¹¹) aren’t as¹⁴ efficient as stackless coroutines, compared to stackful green threads all other usermode cooperative multitasking alternatives (not just those which are entirely stackless) require the performance cost of significant additional heap allocation. To dynamically grow the size of stacks, stacks must be movable to avoid a pathological corner case with split (aka segmented) stacks. Movable stacks require special handling and/or restrictions on references pointing into the stack unless perhaps if mmap is employed yet mmap isn’t optimal.

The term ‘process’ in this document means a usermode task and shouldn’t be confused with an OS process. In single-threaded JavaScript, a task begins as an event callback or from the root of execution. Note reentrancy means that even the same event callback handler could be interrupted and called again before its prior invocation has completed. ↩
The term non-blocking in this context can be misleading because the calling task may sleep (i.e. “block”) until the asynchronous operation completes. The distinction is a blocking operation doesn’t return to the caller until the operation has completed. Whereas, a non-blocking operation returns immediately to the calling task and enables the task to save its state and via cooperative multitasking then “context-switch” to another task if the calling task must sleep until the non-blocking operation completes (for example as described for JavaScript). The non-blocking operation inputs a callback for returning the completion of its operation. The “context-switch” is not by the switching of stacks as is the case for OS threads or stackful green threads, but by storing the callback closure context on the heap.

Additionally if there’s an opportunity for safe parallelism, the execution may continue in the calling task and/or spawn other non-blocking operations asynchronously if all these asynchronous code paths are data race safe and not mutually dependent on their respective completion. This means that asynchronous operations (whether they be non-blocking or blocking) shouldn’t input mutable references to the same data. This restriction on sharing mutable data applies of course also to asynchronous operations which input callbacks that are closures and reference instead of copy the data in the lexical context. ↩
The callback must be scheduled to execute round-robin thus can’t be called directly by the another OS thread which may be executing the asynchronous operation, given that JavaScript insures single-threaded execution of events. ↩
Conceptually the await keyword transforms the source code such that the remainder of the function body is inserted into a callback which is a closure on the lexical scope that thus captures the access to the data accessible in the lexical scope. The callback is an input to the then method of the Promise returned by the non-blocking operation at the right-hand-side (RHS) operand of (i.e. which follows) the await (prefix unary) expression. The async keyword wraps the return value of the function in a Promise. Note the actual implementation may be optimized as a “class finite state machine”.¹³

If the await is contained inside a loop, the code transformation should move to the end of the loop body, the portion of the loop body above the await including any conditional test at the top of the loop. Note I have posited that await should also transform try-catch-finally scopes which enclose the await, but I have not yet verified if JavaScript (and other implementations such as C# and a transpiler for Java) are doing this or the loop transformations. Analogous transformations are required if await in contained in a switch-case. ↩ ↩²
Scattered throughout the lengthy Concurrency and WD-40 threads. Note in the web browser at those linked Github Issues threads, use Ctrl+F to search for “Load More…” to load all the posts before searching for all instances “Rust”. ↩
For numerous possible circumstances including that some programs can’t be reasonably coaxed entirely into provably safe stack allocation. ↩
Incremental tracing in the presence of multithreading amortizes operations to reduce latency (compared to non-incremental tracing) at the cost of increased overhead but doesn’t ameliorate the fundamental problem that multithreaded tracing GC doesn’t scale to massively multicore. IOW, as the number of cores (i.e. simultaneously running threads) increases, the multithreading compatible incremental tracing collector consumes (due the Amdahl’s law) an ever greater percentage of the processor resources. ↩
Instead of periodically tracing the entire “actor” references graph which wouldn’t be scalable due to stopping-the-world of all “actor” partitions, in this clever garbage collector “actors” maintain their incoming references count (i.e. the count of other “actors” which are referencing the said “actor”) and a list of all outgoing references (i.e. “actors” which said “actor” references). When an “actor”’s message queue remains idle for some period of time aka quiescence (i.e. potentially dead and unused), then it sends a message to the special “actor” which will detect dead (i.e. stranded islands of) cycles in the “actor” references graph. This special “actor” employs a confirmation acknowledgement protocol (for only the subset of involved “actors” instead of the entire “actor” references graph) to distinguish true cycles from ephemeral delays in inter-“actor” synchronization of message state.

A limited form of §Weighted reference counting is employed to optimize the implementation. Note §Deferred Reference Counting means that references from each “actor” partition (to any data, including but not limited to “actors”) are tracked with a distinct periodic tracing garbage collector (but apparently not the generational variant) for each “actor”. If the said tracing collector counts any number of references to an external “actor” then those count as a single reference count for the novel inter-partition “actor” reference graph garbage collection. And note the conceptually separate ORCA garbage collection works analogously for data shared between “actors”.

Unlike for non-partitioned paradigms (e.g. Rust, Java, etc.) which have a single tracing garbage collector for the entire program (for the heap allocated data) that can’t scale to multicore and has to tradeoff latency for throughput overhead (and typically aren’t suitable for ultra low latency use cases), these plurality of tracing garbage collectors (one for each “actor”) doesn’t interfere with scaling because only the single-threaded partition is paused while tracing, not the entire world of all partitions. At any given time only a fraction of the ”actors” will be stopped by tracing garbage collection, so there’s no global latency pause. ↩ ↩²
The creator of the Actor model refers to unbounded non-determinism as “indeterminacy” (where the unbounded entropy of the universe is intertwined in the state machine of the program, for example by entering the bounded nondeterminism of the static program via I/O). ↩
The scheduler in M:N threading has fewer processors than (data race safe) tasks available to execute simultaneously on the processors. So the scheduler attempts to minimize priority inversion, contention and the starvation of all involved resources including the processors, network, and local persistant storage. Asynchronous resource coordination is a complex balancing act (c.f. §2.5 Scheduling of Tasks), not solely the responsibility of the scheduler. For example in some designs even issues with buffering can cause buffer bloat and increased latency.

Go’s scheduler (c.f. also and also) always has at most and attempts to have at least GOMAXPROCS quantity (which is usually set to the number of the machine’s processors) of OS threads unblocked at all times in order to minimize slow blocking and unblocking of OS threads. OS threads which are blocked in a system call aren’t associated with any of the P(rocessor) data structures. Unassociated idle (i.e. unblocked) OS threads attempt to associate with an unassociated P(rocessor) data structure until the GOMAXPROCS quantity of P(rocessor) data structures are associated. Associated idle OS threads attempt to steal work (half of a found queue of pending goroutine tasks) from another P, take work from the global queue or run the netpoller. Notably Go implements sleep with usermode code instead of a system call so that the OS thread isn’t blocked.

Note Go also supports a key-value store cache named MCache, which isn’t germane to our generalized exposition here. Note it’s thread-safe even though it’s allocated in each of the GOMAXPROCS quantity of P(rocessor) data structures, because only one OS thread is unblocked at any given time on each P.

For blocking operations that don’t involve system calls, timeouts can be less costly in a M:N threading model if fewer OS threads are required because they’re not spinning idled or blocked during lengthy timeouts. One could also contemplate a M:N threading design tradeoff between the resource requirements for the additional OS threads needed if OS threads aren’t usermode context-switched before calling into blocking system calls (as is sometimes the case for Go), compared with the additional overhead and potential for starvation by employing an OS thread pool (that executes the system calls) and a message queue of such pending operations (which is essentially what Go does for network operations). For the latter design choice perhaps timeouts will also be less costly for the same reason. ↩ ↩²
Green threads are one means of implementing coroutines. Green threads employ a lightweight user mode context instead of heavyweight OS context (c.f. also). They cooperatively (instead of preemptively) context-switch by inserting conditional tests at key points in the program and/or always context-switching on blocking operations that aren’t in a system call. The advantage to context-switching on function calls is that the context may not have to save as many CPU registers. The disadvantage to cooperative context-switching (even if not only on function calls or other key points) is the scheduler may never obtain an opportunity to context-switch, and the task may monopolize the OS thread. ↩ ↩² ↩³
It may not even be necessary to block in system calls for I/O. That might improve performance. Apparently Windows has better capabilities for non-blocking I/O than Linux. ↩
The conjectured improvement for local concurrency that can be expressed as a sequential control flow, should propagate return values and exceptions from the spawned tasks into the parent (aka calling) task, thus in effect emulating the proposed nursery concept and solving all the cited issues because the parent task reclaims ownership of all shared resources. This wrapping and abstraction of message queues such that they become function calls with optional return values even ameliorates the criticism that deterministic “Actors” (i.e. CSP-style channels) aren’t composable. The indeterminate lifespan issue will remain for spawned CSP tasks because that provides the necessary flexibility to not express all interactions in sequential control flow. Yet also adding the nursery context escape will address additional issues that aren’t addressed by AMM collection. ↩ ↩²
The context of the discussion with direct quotes to the cited portion of the linked document was here, which was linked from the OP of the Concurrency thread discussion. ↩ ↩²

Edited Dec 26, 2018 by Shelby Moore III