Skip to content

Implement TransformationsManager

Previous title: Add default functions for storing TransformationDAG output into files or inside DataManager

When using the TransformationDAG only for computing data (and not for plotting), one has to piggy-back on the PlotManager and the plot_func to store the resulting data.

Two very useful defaults would be:

  1. Store all final nodes of the TransformationDAG in the original DataManager.
  2. Store all final nodes of the TransformationDAG as files in the out_dir (currently defined in the PlotManager cfg)

Proposal

Afais, the TransformationDAG's intended use case is for creating plots. However, if one wants to use it only for "data transformation", it seems like one has to use the PlotManager nonetheless to make the data available outside the dantro ecosystem. Using the plot functions for storing data is cumbersome:

  1. When using simple plot functions without decorator, the DataManager is not available and has to be piped through the TransformationDAG with it's own node and tag.
  2. When extracting data from the DAG for adding it to DataManager or storing it into a file, the tags have to be hard-coded. However, the TransformationDAG should actually have the information on which nodes are the "final" ones. These could automatically be written to files, using the default writers for the respective data containers.

Ideas for an implementation

(very unsure about this...)

Implement a lightweight StorageManager akin to the PlotManager. Configuration could look like this:

---
data_dir: ~/
data_manager:
  # ...

storage_manager:
  raise_exc: true
  out_dir: '{timestamp:}/'
  default_task: to_dm  # or 'to_file'

  # Need some way of specifying which nodes to store,
  # if not only the "terminal" ones.
  # NOTE: Not specifying nodes here would lead to only
  #       'std_norm' being stored in this example
  task_kwargs:
    store_nodes: [mean, std_norm]  

eval:
  task1:
    use_dag: true
    select:
      data: data  # From data_manager
    transform:
      - .mean: [!dag_tag data]
        kwargs: {axis: 0}
        tag: mean
      - .std: [!dag_tag data]
        kwargs: {axis: 0}
        tag: std
      - div: [!dag_tag std, !dag_tag mean]
        tag: std_norm

Code:

dm = DataManager(data_dir, **cfg.get("data_manager", {}))
dm.load_from_cfg(load_cfg=cfg["data_manager"]["load_cfg"], print_tree=True)

sm = dtr.StorageManager(dm=dm, **cfg.get("storage_manager"))
sm.eval_from_cfg(eval_cfg=cfg.get("eval"))

Afterwards, I would expect the dm to contain the two new nodes mean and std_norm:

Tree of DataManager '28947753', 4 members, 0 attributes
 └┬ data                        <XrDataContainer, float32, …
  └ task1                       <XrDataContainer, float32, …
    └┬ mean                     <XrDataContainer, float32, …
     └ std_norm                 <XrDataContainer, float32, …
Edited by Yunus Sevinchan
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information