Rework Met.3D pipeline architecture: Lifetime safety, determinism (#1212) · Issues · Marc R. / met.3D

Rework Met.3D pipeline architecture: Lifetime safety, determinism

The current Met.3D pipeline architecture couples a mutable pipeline (e.g., `setInputSource()`, prefixes, and other changes that could be made at _any_ time) with asynchronous scheduling. This leads to crashes. The core problem is that the application assumes pipeline states remain unchanged and stable over the lifetime of requests, or even the lifetime of the application. # Current problems ## Pipeline mutation while requests are pending A `MDataRequest` does not encode which parent data sources are used. Parent sources are resolved via the mutable `setInputSource()` state. If input sources are changed quickly (e.g., a data variable is changed in an actor, causing some parent data source to change quickly from A to B): - Task A after changing to A is created and scheduled. - Task B after changing to B is created and scheduled. The first task is not yet running or is still running, typically waiting for parent sources to finish. - The first task starts processing and tries to call `parent->getData()`, but the parent is already B instead of A. - This results in unavailable data items and crashes. This is also an issue in actors with pipelines that are rebuilt during run time, for example jetcores and fronts. The actor might rebuild the pipeline at any time, doing some `deregisterInputSource()` followed by `registerInputSource()`. If at the same time, another thread executes this pipeline (either building the task graph, producing data items, or just fetching the required keys for the pipeline), the thread might operate on an invalid pipeline state. ## Use-after-free of data sources Many data sources are actor-owned and deleted dynamically, e.g., when deleting the actor. Tasks store raw pointers to data sources. If a data source is deleted while tasks are queued or executing, this leads to a crash. Also the memory of these data items might leak. ## Request is context dependent Although a request string is _intended_ to uniquely identify a computation, its meaning depends on the current pipeline. A data item produced by a request might be different if one of the input sources of the data source changes, even though the request itself is still valid. This can also lead to the memory manager returning a cached data item for the same data request, although the returned data item would be different from the one that would be produced if the request were recomputed. ## Non-reuse of identical computation results Some actors (fronts, jet cores, trajectories, etc.) create and own their own data sources (front data source, isosurface intersection source, etc.). Multiple actors of the same type therefore do **not** share these data sources. As a result, two identical jet core actors would both fully compute a data item emitted from the same data request, because they are using two different data source instances. # Design goals **Tasks should execute against a stable pipeline definition.** The system should not crash if an input source changes or if a data source in the pipeline is deleted. Furthermore, data sources should be stateless processors. This enables safe reuse of identical computations across actors. ## Option A: Pipeline snapshot When scheduling a request, capture an immutable snapshot of the pipeline used for that request. Instead of storing only the data request and the data source to request the data from, a task would store the data request (unchanged from the current version) and an `MPipelineContext`. The `MPipelineContext` references the pipeline tree: it contains, as the root node, the initial data source to request from and the data request prefix for that data source. Recursively, it contains child nodes referencing the parent sources together with their respective prefixes. This context is built once at scheduling time and is therefore immutable. Routing through the pipeline tree is fully defined in the context. Data sources are stored as `shared_ptr`, so when an owner deletes its data source, it will not be destroyed until the corresponding pipeline context is destroyed as well, which happens after the data request has finished. This would also allow for an interface to abort running tasks. This option still uses the current pipeline concept with some lingering issues, but makes it safe by snapshotting. ## Option B: Pipeline fully defined in data request This option requires a central pool of singleton data sources, i.e., there is only one data source instance per type, which is stateless and performs the computation. All data sources have unique identifiers used to reference them. In Option B, the entire pipeline topology is encoded in the `MDataRequest` itself, while data sources are pure, stateless processors. There is no `setInputSource()`. The scheduler derives dependencies solely from the request, which contains unique identifiers referencing the data source pool. This requires changes to how data requests are built. Routing would be similar to the existing key prefix mechanism. As a thought experiment, request prefixes for all data sources could be mandatory, making it straightforward to extract sub-requests when calling parent sources. A mapping would be required to determine which prefix is forwarded to which data source ID. Encoding all of this into a single string might be cumbersome, maybe a struct is better. Probably an actor would then build its own pipeline when the actor is created, so it can properly insert the correct keys. That needs some more thought. Option A feels like a short-term solution, while Option B addresses the fundamental issues, making the correspondence “same data request = same data item” explicit and reliable.

issue