Skip to content

Resolve "Implement the `DataManager`"

With this MR, the DataManager class is implemented. It takes care of loading data from a fixed data_dir and configured by a load_cfg.

Main features:

  • Is associated with a data directory
  • Can create a time-stamped output directory inside the data directory
  • Can load data via a load configuration (see below for example configuration)
    • Can control where the loaded data is stored within the data manager
      • into a new group or container with the name of the configuration entry (default)
      • into a container with a specified name
      • into an existing or non-existing group (by path)
    • Ensures that data is not silently overwritten; needs to be explicitly configured to allow this
    • Gives a progress indicator
  • Can be easily extended by data loader mixin classes
  • Gives understandable error messages if something was badly configured or went wrong

Note that, to achieve this, quite a lot of procedural code and parameters were needed in the load method and its helpers.

Loader mixin classes

This MR implements two loader mixin classes:

  • YamlLoaderMixin: for loading yaml data
  • Hdf5LoaderMixin: for loading HDF5 data, recursively resolving h5.Group objects to DataGroups, and h5.Datasets to DataContainers, carrying over attributes

Example load configuration

---
minimal:             
  loader: yaml       # yml also works
  glob_str: '*.yml'  
  # all yml files on the top level of the `data_dir` will be loaded
  # into the OrderedDataGroup /minimal

all_yaml:
  loader: yaml
  glob_str:
   - '*.yml'
   - '*.yaml'
   - '**/*.yml'
   - '**/*.yaml'
  always_create_group: true
  # all files in the data_dir/ or subdirectories will be loaded
  # that match any one of the four glob string. The results will be stored
  # in the group /all_yaml, even if only a single file was found.

grouped_data:
  loader: yaml
  glob_str: 'group/*.yml'
  ignore: ['group/cfg.yml']
  always_create_group: true
  target_group: my_group
  # all yml files in data_dir/group (except group/cfg.yml) will be loaded
  # into /my_group, even if it is only one file

uni_data:
  loader: hdf5_proxy
  globstr: 'universes/**/uni*_data.h5'
  path_regex: 'uni([0-9]+)_cfg.yml'
  # all hdf5 files of the given pattern will be loaded into
  # the new group `/uni_data` under a regex-parsed name
  # Example: uni123_data.yml will be accessible via /uni_data/123
  # Additionally, the data is proxy, and will only be loaded if needed

Can this MR be accepted?

  • Implementation finished
    • DataManager class
    • Initialisation functions, directory creation, ...
    • load function for a single entry and helpers
    • load_from_cfg function that can load multiple entries
    • An interface to extend the class with loader mixin classes
    • A decorator to declare these functions
    • Useful loader mixins
    • Hdf5DataProxy and corresponding mixin class
  • Tests written
    • Full coverage of data_mngr module (except NotImplementedError and impossible-to-test cases)
    • Implement fixture to write output data that is then loaded in again as data_dir
    • Test that results are loaded into the desired location
    • Test that name clashes (upon existing data) are communicated clearly
    • YamlLoaderMixin tested
    • Hdf5LoaderMixin tested
    • Proxy data working
  • Pipeline passing
  • MR Description written

Related issues

Closes #3 (closed)

Edited by Utopia Developers

Merge request reports

Loading