Implement basic DataContainer

Description

As discussed in the MC use case (#113 (closed)) the ensembles in the Monte Carlo module should own a container that handles the data produced during an MC run. These data should be sufficient to repeat (data provenance) and restart the run. This means that the data container (DataContainer) class should handle

  • meta data (run when, possibly by whom and where, name of ensemble, etc)
  • run information (including the initial settings associated with the ensemble, atomic structure, set up of prng, etc)
  • data as a function of MC step
    • observables and parameters of the ensemble, such as energy, concentration
    • observables generated by observers, such as short-range order parameters, configurations

The class should be added to the mchammer/data_container.py module and has to provide functions that enable e.g., initialization of an Ensemble class

class BaseEnsemble:
    def __init__(self, atoms, ...):
        ...
        self.data = DataContainer(..., atoms)

class DataContainer:
    def __init__(self, atoms, ...):
        ...
        self.__structure = atoms.copy()

    @property
    def structure(self):
        return self.__structure.copy()

class SomeEnsemble(BaseEnsemble):
    def __init__(self, ...):
        ...
        # call base class constructor
        self.data.add_parameter('temperature', 300.0)
        self.data.add_parameter('chemical-potential-difference', -0.5)
        self.data.add_observable('energy', float)

        # the duplication of temperature is intentional since temperature
        # is both a parameter that is needed for restarting and an observable
        # that could be included in the output stream
        self.data.add_observable('temperature', float)

        for obs in self.observers:
            # obs.property_type would probably be `list`
            self.data.add_observable(obs.tag, obs.property_type)

During an MC run one would need to add data to the DataContainer object

self.data.append(self.mcstep, 'energy', energy)
if self.mcstep % self.minimum_interval == 0:
    for obs in self.observers:
        if self.mcstep % obs.interval == 0:
            self.data.append(self.mcstep, obs.tag, obs.get_observable(...))

Note that ensemble parameter such as temperature or chemical potential would usually not be appended at every MC cycle but only when the they are explicitly being changed, e.g.,

class SomeEnsemble(BaseEnsemble):
    @x.setter
    def temperature(self, temperature):
        self.temperature = temperature
        self.data.append(self.mcstep, 'temperature', temperature)

During analysis one would require the following functions

dc = DataContainer(...)
print(dc.parameters)
>> OrderedDict([('temperature', 300.0), ('chemical-potential-difference', -0.5)])

print(dc.observables)
>> OrderedDict([('energy', float), ('temperature', float), ('sro', list), ...])

data = dc.get_data(['mcstep', 'energy', 'temperature', 'sro1'],
                   interval=..., filter=..., fill_missing=False)
print(data)
>> [[0, -100.0, 500.0, 0.1],
    [10, -99.1, None, 0.2],
    [20, -98.7, None, 0.15],
    ...
    [500, -97.8, 200.0, -0.2]]
    [510, -98.1, None, -0.22]]
    ...
    [1000, -99.8, None, -0.4]]

The fill_missing option affects how missing elements (None in the example above) are treated. Setting the option to True should lead to [500.0, 500.0, 500.0, 500.0, ..., 200.0, 200.0, 200.0, ... 200.0] for the temperature column in the example above.

Notes

  • ensure that fields (parameter, observable names) are not overwritten/doubly defined
  • use assertions wherever possible/appropriate

Sub-tasks

  • define interface and functions (using pass)
  • implement data structure (using pandas)
  • add complete unit tests

Please note that read/write and restart functionalities are assigned to separate issues.

Demonstration

  • tests pass
  • doc strings complete
Edited by William Armando Munoz